Exploiting Heterogeneous Cpus/Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting Heterogeneous CPUs/GPUs David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA General Purpose Computing . With the introduction of multi-core CPUs, there has been a renewed interest in parallel computing paradigms and languages . Existing multi-/many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs) . Heterogeneous systems are becoming a common theme . Are we returning to the days of the X87 co-processor? . How should we combine multi-core and many-core systems into a single design? Heterogeneous Computing “….electronic systems that use a variety of different types of computational units…..” Wikipedia The elements could have different instruction set architectures The elements could have different memory byte orderings (i.e., endianness) The elements may have different memory coherency and consistency models The elements may only work with specific operating systems and application programming interfaces (APIs) The elements could be integrated on the same or different chips/boards/system Trends in Heterogeneous Computing: X86 Microprocessors . 1978 – Intel 8086 . Designed to run integer-based CPU-bound programs (e.g., Dhrystone) efficiently . No explicit floating point support . 1980 – Intel 8087 . 50 KFLOPS!!!!! . IEEE 754 definition . 1982 – Intel 80286/287 . 1985 – Intel 80386/387 and AMD AM386 w/387 . 1989 – Intel 80486DX . First integrated on-chip X87 Trends in Heterogeneous Computing: X86 Microprocessors . 1996 – Intel Pentium . MMX multimedia extensions . 1997 – AMD K6 . MMX and FP support . 1998 – AMD K6-2 . Extends MMX with 3DNow . SIMD vector instructions for graphics processing . 1999 – Intel Pentium III . Introduces SSE to X86 . 2001-2005 – Intel Pentium IV/Prescott and AMD Opteron/Athalon . SSE2 and SSE3 . 2006 – Intel Core and AMD K10 . SSE4, SSE4.2 and SSEa Trends in Heterogeneous Computing: X86 Microprocessors What spurred on these changes/advances? . The inefficiency of X86 to effectively emulate floating point . The need for increased precision in computations . The desire to have interactive games (e.g., Flight Simulator, Donkey Kong) . The emergence of multimedia (voice, video, graphics) . The competitive market! Some other examples of heterogeneous integration . The IBM Cell . Soul of the Sony PS3 . Composed of 1 Power Processing Element (PPE), with 8 physical SPEs . 9 DMA units for memory transfers . The Analog Devices Blackfin . Integration of a classic fixed-point digital signal processor (DSP) and a microcontroller . One instruction set . Shared memory architecture . The TI OMAP . Integration of an ARM and one or multiple DSPs . Popular cell-phone and media player platform . Shared memory architecture . Graphics Processing Units . More than 64% of Americans played a video game in 2009 . High-end - primarily used for 3-D rendering for videogame graphics and movie animation . Mid/low-end – primarily used for computer displays . Manufacturers include NVIDIA, AMD/ATI, IBM-Cell . Very competitive commodities market Enter GPGPU – desktop supercomputing! . GPU manufacturers made their chips programmable . OpenGL and DirectX provide support for programming shaders . NVIDIA GeForce3 was the first architecture to support this move (2002) . NVIDIA’s CUDA had a huge impact on lowering the threshold to accessing the GPU for general purpose computing . AMD’s Brook+ also played an important role . GPU manufacturers decide to make chipsets to specifically support the programmable GPU market . NVIDIA Tesla and Fermi . What spurred this change? . The need for 3D and 4D data processing A wide range of GPU apps Film 3D image analysis Protein folding Financial Adaptive radiation therapy Quantum chemistry Languages Acoustics Ray tracing GIS Astronomy Radar Holographics cinema Audio Reservoir simulation Machine learning Automobile vision Robotic vision / AI Mathematics research Bioinfomatics Robotic surgery Military Biological simulation Satellite data analysis Mine planning Broadcast Seismic imaging Molecular dynamics Cellular automata Surgery simulation MRI reconstruction Fluid dynamics Surveillance Multispectral imaging Computer vision Ultrasound N-body simulation Cryptography Video conferencing Network processing CT reconstruction Telescope Neural network Data mining Video Oceanographic research Digital cinema / projections Visualization Optical inspection Electromagnetic simulation Wireless Particle physics Equity training X-Ray GPGPU is becoming mainstream research Research activities are expanding significantly Search result for keyword “GPGPU” in IEEE and ACM AMD/ATI Radeon HD 5870 • Codename “Evergreen” • 1600 SIMD cores • L1/L2 memory architecture • 153GB/sec memory bandwidth • 2.72 TFLOPS SP • OpenCL and DirectX11 • Hidden memory microarchitecure • Provides for vectorized operation Comparison of CPU and GPU Hardware Architectures CPU/GPU Single Cores GFLOPs/ $/GFLOP precision Watt TFLOPs NVIDIA 285 1.06 240 5.8 $3.12 NVIDIA 295 1.79 480 6.2 $3.80 AMD HD 5870 2.72 1600 14.5 $0.16 AMD HD 4890 1.36 800 7.2 $0.18 Intel I-7 965 0.051 4 0.39 $11.02 Source: NVIDIA, AMD and Intel . The Medical Imaging field is rapidly deploying new 3-D and 4-D imaging technologies to improve patient outcomes . This move has created an avalanche of image data . Image reconstruction and image analysis have become major bottlenecks . Accurate 3-D and 4-D image reconstruction requires compute-intensive algorithms . The use of multi-modality imaging (e.g., CT and Ultrasound) further exacerbates this problem . Heterogeneous computing will play a large role in addressing these challenges Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging $1.3M NSF Award EEC-0946463 Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging $1.3M NSF Award EEC-0946463 . Currently, coronary heart disease (CHD) is the single leading cause of death in America . Health care costs related to CHD >$150B/year . U.S. in 2006 (American Heart Association) . Approximately 1,255,000 coronary attacks . Approximately 425,425 deaths . Invasive coronary angiography is the state-of-the-art for assessing coronary blockages . Inject dye into the bloodstream and then Xray the heart . 8% complication rate . 0.2% mortality rate 3-D Cardiovascular Plaque Imaging . 3D CT imaging can be used to identify vulnerable plaque . A helical scan of the body is performed . Provides for more accurate imaging of the cardio-vascular system . Produces a detailed 3-D view of the blockage . Possesses few negative side effects . Scanning geometry produces a tremendous amount of data to process Image reconstruction can take days to generate a single view!! Iterative CT Image Reconstruction . 3-D Spiral Cone-Beam Cardiac Image Reconstruction . Reconstruction performance is a barrier to improve image quality . Forward/backward projections consume more than 95% of total reconstruction time in an iterative helical cone-beam CT image reconstruction method . Comparison of a single OpenCL/AMD HD5870 implementation versus a multi-threaded optimized version on an Intel Core-2 Duo Execution time comparison (one projection) Reconstructed cardiac image Execution time (seconds) 31x speedup (250*250*9, 1160 projections) *In collaboration with H. Pien and C. Karl . A new technology developed at MGH to: . Produce a 3-D image of the breast utilizing 15 or more 2-D projections . 3-D imagery can help address the following issues related to 2-D mammography . Increase the correct detection rate of cancers . Reduce the rate of misdiagnosed cancers – avoid unneeded biopsies 2-D DBT 2-D DBT Cancer Hammartoma Increase correct detection rate Decrease false positive rate Tomosynthesis Image Reconstruction X-ray source (15 views) X-ray projections Set 3D volume (guess) Compute projections Forward • Utilizes a limited angle Correct 3D volume tomography approach using Backward many 2-D images to generate a 3-D image • Performs an iterative Maximum Likelihood Estimation 3D volume for 3-D image reconstruction (1196x2304x45) Detector • Reconstruction time is a (1196x2304) barrier to image-guided biopsy Reconstruction Speedup 25X speedup Reduces false positives Patient receives feedback in the same visit Enables image-guided biopsy Improves patient outcomes *In collaboration with R. Moore and W. Meleis OpenCL – The future for heterogeneous computing Being developed by Khronos Group – a non-profit Open Compute Language LLVM compiler Looks a lot like CUDA A framework for writing programs that execute on a range of heterogeneous systems Present support for AMD/NVIDIA GPUs, Cell, X86 multi- core CPUs, IBM Power, and ARM More about OpenCL during this afternoon’s tutorial GPU/OpenCL Strengths . Supercomputing on the desktop . Easy to program (small learning curve) . Already have demonstrated success with several