073-080.Pdf (568.3Kb)

Graphics Hardware (2007) Timo Aila and Mark Segal (Editors) A Low-Power Handheld GPU using Logarithmic Arith- metic and Triple DVFS Power Domains Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo Department of EECS, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea Abstract In this paper, a low-power GPU architecture is described for the handheld systems with limited power and area budgets. The GPU is designed using logarithmic arithmetic for power- and area-efficient design. For this GPU, a multifunction unit is proposed based on the hybrid number system of floating-point and logarithmic numbers and the matrix, vector, and elementary functions are unified into a single arithmetic unit. It achieves the single-cycle throughput for all these functions, except for the matrix-vector multiplication with 2-cycle throughput. The vertex shader using this function unit as its main datapath shows 49.3% cycle count reduction compared with the latest work for OpenGL transformation and lighting (TnL) kernel. The rendering engine uses also the logarithmic arithmetic for implementing the divisions in pipeline stages. The GPU is divided into triple dynamic voltage and frequency scaling power domains to minimize the power consumption at a given performance level. It shows a performance of 5.26Mvertices/s at 200MHz for the OpenGL TnL and 52.4mW power consumption at 60fps. It achieves 2.47 times performance improvement while reducing 50.5% power and 38.4% area consumption compared with the latest work. Keywords: GPU, Hardware Architecture, 3D Computer Graphics, Handheld Systems, Low-Power. 1. Introduction tive high graphics performance of 7.2Mvertices/s for As the mobile electronics advances rapidly, the OpenGL transformation and lighting (TnL). However, handsets like cell phones and PDAs are adopting the its dynamic range was limited by the fixed-point arith- realtime 3D computer graphics with high performance metic. processors consuming only limited power and area. Re- cently, the embedded 3D graphics API like OpenGL-ES In this paper, a power- and area-efficient GPU based [Khr05] is defined and it adopts the programmability on the logarithmic arithmetic is proposed. Although the into the 3D graphics pipeline to support various graphics logarithmic arithmetic carries a little computation error, effects. Thus, the vector processors called shaders are it can significantly reduce the arithmetic complexity included into the GPU for the programmability and it is [Mit62, CCS*00]. Therefore, the logarithmic arithmetic required to support matrix, vector, and elementary func- is adopted for our GPU, which is targeted for the hand- tions to process various geometry transformation and held systems with small size screens. In addition, the lighting kernels. proposed GPU is divided into triple power domains with dynamic voltage and frequency scaling (DVFS) for the There have been several studies on the programma- minimum power consumption at a given performance ble graphics processors. The conventional shader archi- level. The concept of using separate power domains for tecture incorporates the 4-way vector SIMD unit for the the geometry and rendering stages was proposed in vector multiply, multiply-add, and dot-product and spe- [SLS04]. It is extended into triple power domains and cial function unit (SFU) for reciprocal, reciprocal-squre- implemented in our GPU. root, logarithm ( ), and exponential ( x ) functions log2 x 2 The paper is organized as follows. First, the GPU ar- [LKM01]. A graphics processor for the cell phones is chitecture is proposed. The number system using the proposed in [KKF*03] and it incorporates 2-way SIMD logarithmic arithmetic is described for our GPU design. unit for multiply-add and special function unit for recip- Then, the multifunction unit based on this number sys- rocal, reciprocal-square-root, and power functions. tem is presented and followed by the vertex shader in- However, it showed limited performance of cluding this unit. The rendering engine using the loga- 185Kpolygons/s, which is far below the performance rithmic arithmetic is also presented. After that, the level supported by modern handheld GPUs [SWY04, power management scheme is described and the evalua- AYH*04, KCY*06, YCK*06]. The vertex shader based tion results will be shown. Finally, conclusions are made. on the fixed-point arithmetic is proposed in [SWY04] for the power- and area-efficient design. It shows rela- Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. Graphics Hardware 2007, San Diego, CA, August 04-05, 2007. © 2007 ACM 978-1-59593-625-7/07/0008 $5.00 B.-G. Nam and et al. / A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains 74 11b PMU 11bPMU 11b PMU RISC VS RE Domain Domain Domain Vdd Clk VddClk Vdd Clk 128b 16b Matrix Vertex Index Rendering 32b FIFO FIFO RISC Shader Engine 128b 128b (VS) VOB (RE) VIB 128b 32b 32b 32b External asynchronous SRAM Figure 1: Overall architecture of the proposed GPU 2. Handheld GPU Architecture 3.1. Logarithmic converter Figure 1 shows the architecture of the proposed When x is 32-bit FLP input, x and its logarithm can handheld GPU. It mainly consists of three major mod- be represented by (1) and (2), respectively. ules i.e. an ARM10-compatible RISC processor, a vertex shader, and a rendering engine. The RISC processor is x = 2(1e + m ) (1) included for simulation of the artificial intelligence and collision detections required for 3D gaming applications. The vertex shader (VS) and the rendering engine (RE) log22x = em++ log (1 ) (2) are designed based on the benefits of the logarithmic arithmetic, from which the complexities of arithmetic The e is the integer part and log (1+ m ) is the fractional operations are reduced. 2 part of the logarithmic number. The nonlinear term is approximated by piecewise linear expres- The GPU is divided into triple power domains to log2 (1+ m ) manage the power consumption of each module inde- sions as (3). pendently to get the lowest power consumption at a given performance level. Three power management log2 (1+ mamb ) ≈⋅ + (3) units (PMUs) are included to manage the clock frequency and supply voltage of each domain. The FIFOs where a and b are the approximation coefficients defined are used across the power domains for buffering the for each approximation region. It achieves maximum transferred data between each domain in spite of the 0.41% conversion error with 15 approximation regions. long response delay of the power management units. 3.2. Antilogarithmic converter 3. Hybrid Number System When X is the LNS, its FLP number can be repre- It is known that the logarithmic number system sented by (4). (LNS) can simplify the arithmetic operations like multiplication, division, and square-root into addition, sub- Xefef+ traction, and right shift, respectively. However, the addi- 22= =⋅ 22 (4) tion and subtraction in the LNS become more complex and requires nonlinear function evaluations. Therefore, a The integer part e directly becomes the exponent of the hybrid number system (HNS) of the LNS and floating- FLP number and the nonlinear term 2 f is approximated point number system (FLP) was introduced to solve this by piecewise linear expressions like (5). problem in [LW91], where the addition and subtraction are carried out in the FLP while the other operations are 2 f ≈ af⋅+ b (5) performed in the LNS. For the number conversion between floating-point and logarithmic numbers in the HNS, the logarithmic and antilogarithmic converters are where a and b are the approximation coefficients for proposed. These are based on the piecewise linear ap- each approximation region. It achieves maximum 0.08% proximation scheme to reduce area and power consump- conversion error with 8 approximation regions. tion [AS03, AbS03]. 4. Multifunction Unit ⓒ Association for Computing Machinery, Inc. 2007 B.-G. Nam and et al. / A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains 75 The proposed functional unit operates on the HNS to for the final accumulation required for the MAT. It is reduce the arithmetic complexities. It unifies the matrix also used as a rounding logic for other operations. (MAT), vector (VEC), and elementary (ELM) functions into a single arithmetic unit and its operation set Channel 0 Channel 1 Channel 2 Channel 3 log2 x y includes the matrix-vector multiplication, vector log c log c log c log c z my0 2 1 my1 2 2 my2 2 3 my3 2 4 multiply, divide, divide-by-square, multiply-add, lerp, 32 24 32 32 24 32 24 32 24 32 A B 32b x 24b 32b x 6b 32b x 6b 32b x 6b 32b x 6b Radix-4 Booth dot-product, cross-product, power, logarithm, and Encoder 012s0 34 5s1 678s2 91011s3 s 12 sign of32bx24b trigonometric (TRG) functions like sine, cosine, and Log / Alog LUT Log / Alog LUT Log / Alog LUT Log / Alog LUT (120B) (120B) (120B) (120B) ELM ELM arctangent, etc.. It achieves single-cycle throughput with ELM100 101 sign 10 0 101 sign ELM 100 101 sign 100 101 sign maximum 5-cycle latency for all the operations except 3:2 CSA 0 3:2 CSA 1 3:2 CSA 2 3:2 CSA 3 TRG 100 1 TRG 100 1 for the matrix-vector multiplication which takes 2-cycle MAT,VEC,TRG 1 0 1 0 MAT,VEC,TRG 1 0 1 0 throughput and 6-cycle latency.

073-080.Pdf (568.3Kb)

State Model Syllabus for Undergraduate Courses in Science (2019-2020)

PERL – a Register-Less Processor

Instruction Pipelining in Computer Architecture Pdf

Utilizing Parametric Systems for Detection of Pipeline Hazards

Pipeliningpipelining

Processor Design：System-On-Chip Computing For

Operand Registers and Explicit Operand Forwarding James Balfour, R

A CUDA Program • a CUDA Program Consists of a Mixture of the Following Two Parts

Pipelining Design Techniques

Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture

Delicm: Scalar Dependence Removal at Zero Memory Cost

ECE 341 Lecture # 13 Lecture Topics Types of Pipeline Hazards