Exploiting Heterogeneous CPUs/GPUs

David Kaeli

Department of Electrical and Computer Engineering Northeastern University Boston, MA General Purpose Computing

. With the introduction of multi-core CPUs, there has been a renewed interest in parallel computing paradigms and languages

. Existing multi-/many-core architectures are being considered for general-purpose platforms (e.g., , GPUs, DSPs)

. Heterogeneous systems are becoming a common theme . Are we returning to the days of the co-?

. How should we combine multi-core and many-core systems into a single design? Heterogeneous Computing

“….electronic systems that use a variety of different types of computational units…..” Wikipedia

 The elements could have different instruction set architectures

 The elements could have different memory byte orderings (i.e., )

 The elements may have different memory coherency and consistency models

 The elements may only work with specific operating systems and application programming interfaces ()

 The elements could be integrated on the same or different chips/boards/system Trends in Heterogeneous Computing:

. 1978 – 8086 . Designed to run integer-based CPU-bound programs (e.g., Dhrystone) efficiently . No explicit floating point support

. 1980 – Intel 8087 . 50 KFLOPS!!!!! . IEEE 754 definition

. 1982 – /287

. 1985 – Intel 80386/387 and AMD w/387

. 1989 – Intel 80486DX . First integrated on-chip X87 Trends in Heterogeneous Computing: X86 Microprocessors . 1996 – Intel . MMX multimedia extensions . 1997 – AMD K6 . MMX and FP support . 1998 – AMD K6-2 . Extends MMX with 3DNow . SIMD vector instructions for graphics processing

. 1999 – Intel Pentium III . Introduces SSE to X86 . 2001-2005 – Intel Pentium IV/Prescott and AMD /Athalon . SSE2 and SSE3 . 2006 – Intel Core and AMD K10 . SSE4, SSE4.2 and SSEa Trends in Heterogeneous Computing: X86 Microprocessors

What spurred on these changes/advances? . The inefficiency of X86 to effectively emulate floating point . The need for increased precision in computations . The desire to have interactive games (e.g., Flight Simulator, Donkey Kong) . The emergence of multimedia (voice, video, graphics) . The competitive market! Some other examples of heterogeneous integration

. The IBM Cell . Soul of the Sony PS3 . Composed of 1 (PPE), with 8 physical SPEs . 9 DMA units for memory transfers . The . Integration of a classic fixed-point (DSP) and a . One instruction set . Shared memory architecture . The TI OMAP . Integration of an ARM and one or multiple DSPs . Popular cell-phone and media player platform . Shared memory architecture . Graphics Processing Units . More than 64% of Americans played a video game in 2009 . High-end - primarily used for 3-D rendering for videogame graphics and movie animation . Mid/low-end – primarily used for computer displays . Manufacturers include , AMD/ATI, IBM-Cell . Very competitive commodities market Enter GPGPU – desktop supercomputing! . GPU manufacturers made their chips programmable . OpenGL and DirectX provide support for programming shaders . NVIDIA GeForce3 was the first architecture to support this move (2002) . NVIDIA’s CUDA had a huge impact on lowering the threshold to accessing the GPU for general purpose computing . AMD’s Brook+ also played an important role . GPU manufacturers decide to make chipsets to specifically support the programmable GPU market . NVIDIA Tesla and Fermi . What spurred this change? . The need for 3D and 4D data processing A wide range of GPU apps

Film 3D image analysis Protein folding Financial Adaptive radiation therapy Quantum chemistry Languages Acoustics Ray tracing GIS Astronomy Radar Holographics cinema Audio Reservoir simulation Machine learning Automobile vision Robotic vision / AI Mathematics research Bioinfomatics Robotic surgery Military Biological simulation Satellite data analysis Mine planning Broadcast Seismic imaging Molecular dynamics Cellular automata Surgery simulation MRI reconstruction Fluid dynamics Surveillance Multispectral imaging Computer vision Ultrasound N-body simulation Cryptography Video conferencing Network processing CT reconstruction Telescope Neural network Data mining Video Oceanographic research Digital cinema / projections Visualization Optical inspection Electromagnetic simulation Wireless Particle physics Equity training X-Ray GPGPU is becoming mainstream research

Research activities are expanding significantly

Search result for keyword “GPGPU” in IEEE and ACM AMD/ATI Radeon HD 5870

• Codename “Evergreen” • 1600 SIMD cores • L1/L2 memory architecture • 153GB/sec memory bandwidth • 2.72 TFLOPS SP • OpenCL and DirectX11 • Hidden memory microarchitecure • Provides for vectorized operation Comparison of CPU and GPU Hardware Architectures

CPU/GPU Single Cores GFLOPs/ $/GFLOP precision Watt TFLOPs

NVIDIA 285 1.06 240 5.8 $3.12 NVIDIA 295 1.79 480 6.2 $3.80 AMD HD 5870 2.72 1600 14.5 $0.16 AMD HD 4890 1.36 800 7.2 $0.18 Intel I-7 965 0.051 4 0.39 $11.02 Source: NVIDIA, AMD and Intel . The Medical Imaging field is rapidly deploying new 3-D and 4-D imaging technologies to improve patient outcomes . This move has created an avalanche of image data . Image reconstruction and image analysis have become major bottlenecks . Accurate 3-D and 4-D image reconstruction requires compute-intensive algorithms . The use of multi-modality imaging (e.g., CT and Ultrasound) further exacerbates this problem . Heterogeneous computing will play a large role in addressing these challenges Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging  $1.3M NSF Award EEC-0946463 Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging  $1.3M NSF Award EEC-0946463 . Currently, coronary heart disease (CHD) is the single leading cause of death in America . Health care costs related to CHD >$150B/year

. U.S. in 2006 (American Heart Association) . Approximately 1,255,000 coronary attacks . Approximately 425,425 deaths . Invasive coronary angiography is the state-of-the-art for assessing coronary blockages . Inject dye into the bloodstream and then Xray the heart . 8% complication rate . 0.2% mortality rate 3-D Cardiovascular Plaque Imaging

. 3D CT imaging can be used to identify vulnerable plaque . A helical scan of the body is performed . Provides for more accurate imaging of the cardio-vascular system . Produces a detailed 3-D view of the blockage . Possesses few negative side effects . Scanning geometry produces a tremendous amount of data to process Image reconstruction can take days to generate a single view!! Iterative CT Image Reconstruction . 3-D Spiral Cone-Beam Cardiac Image Reconstruction . Reconstruction performance is a barrier to improve image quality . Forward/backward projections consume more than 95% of total reconstruction time in an iterative helical cone-beam CT image reconstruction method . Comparison of a single OpenCL/AMD HD5870 implementation versus a multi-threaded optimized version on an Intel Core-2 Duo

Execution time comparison (one projection)

Reconstructed cardiac image Execution time (seconds) 31x speedup (250*250*9, 1160 projections) *In collaboration with H. Pien and C. Karl . A new technology developed at MGH to: . Produce a 3-D image of the breast utilizing 15 or more 2-D projections . 3-D imagery can help address the following issues related to 2-D mammography . Increase the correct detection rate of cancers . Reduce the rate of misdiagnosed cancers – avoid unneeded biopsies 2-D DBT 2-D DBT

Cancer Hammartoma Increase correct detection rate Decrease false positive rate Tomosynthesis Image Reconstruction

X-ray source (15 views) X-ray projections Set 3D volume (guess)

Compute projections Forward

• Utilizes a limited angle Correct 3D volume tomography approach using Backward many 2-D images to generate a 3-D image • Performs an iterative Maximum Likelihood Estimation 3D volume for 3-D image reconstruction (1196x2304x45) Detector • Reconstruction time is a (1196x2304) barrier to image-guided biopsy Reconstruction Speedup

 25X speedup

 Reduces false positives

 Patient receives feedback in the same visit

 Enables image-guided biopsy

 Improves patient outcomes *In collaboration with R. Moore and W. Meleis OpenCL – The future for heterogeneous computing

 Being developed by – a non-profit

 Open Compute Language

 LLVM compiler

 Looks a lot like CUDA

 A framework for writing programs that execute on a range of heterogeneous systems

 Present support for AMD/NVIDIA GPUs, Cell, X86 multi- core CPUs, IBM Power, and ARM

 More about OpenCL during this afternoon’s tutorial  GPU/OpenCL Strengths

. Supercomputing on the desktop . Easy to program (small learning curve) . Already have demonstrated success with several complex and important applications . Impressive speedups . Excellent cost/performance . OpenCL provides a very natural platform for pursuing parallization-based techniques . Fusion should provide a rich path for further acceleration! GPU/OpenCL Limitations

. Porting applications to the latest-and- greatest hardware becomes a time- consuming task . Suggests we need to raise the abstraction level . The current OpenCL SDK still does not produce as efficient code as CUDA . Improvements to the compiler and device driver will continue to close this gap Ongoing Research to support Heterogeneity on GPUs

. Memory space selection and memory transformation . How do we best utilize the available memory on the GPU? . How do we support vectorization? . Multi-GPU exploitation . How can we write OpenCL programs for a single GPU, though run transparently on multiple GPUs? . CPU/GPU Virtualization . How can we present a view to applications that there are more GPUs/CPUs available than presently installed on system? . Binary Translation . Support NVIDIA/CUDA PTX to AMD IL migrations Summary and Future Work

 GPUs are revolutionizing desktop supercomputing

 A number of critical applications have been migrated successfully . We will see shortly if heterogeneous CPU/GPU systems will be adopted as the status quo for the desktop market . The key will be power/performance/cost . GPUs have already demonstrated their value in selected domains . OpenCL is the future for heterogeneous computing . The low-end and the high-end are meeting in the middle! The NUCAR GPU Team

 Rodrigo Dominguez

 Byunghyun Jang

 Perhaad Mistry

 Dana Schaa

 Matt Sellitto

 Justin White

 Babatunde Ogunfemi

 Nicole Dawson – Spelman College

 Stevie-Mari Hawkins – Spelman College

 Rashidah Slater – Spelman College

 Joe Robinson – Northern Essex CC

 Many collaborators at NU, MGH, BU, RPI, and UPRM