Undergraduate Category: Engineering and Technology 6 Degree Level: N/A Abstract ID#: 1379 GPU-Compute on Northeastern Undergraduate Matthew Greenlaw Mentor: PhD. Candidate Fanny Nina-Paravecino Principal Investigator: David Kaeli

ABSTRACT RASPBERRY PI GPU PRELIMINARY RESULTS

Objective: • Our research explores parallel implementations of the Raspberry Pi through Performance comparison between Sequential and Parallel approaches: • The Raspberry Pi 2 GPU is a Broadcom VideoCore® IV 3D chip that provides manipulating its QPU enhanced scalability with multiple floating-point processors called QPUs • Based on specifications for the QPU [3], we have developed an assembler that  The benefits of the parallel approach are not apparent until many threads are • Explore the benefits of general purpose on a Raspberry Pi 2, can access its registers, memory allocation, ALU units, etc. assigned to the registers Model B QPU • The assembler we are using is available on github[2] • Since the device does not presently support CUDA, OpenCL, or other high-level  Since the Raspberry Pi can only run a limited number threads as compared to parallel computing languages, we work with QPU assembler code to access the the Tesla K20c or the Maxwell 980Ti, the Raspberry Pi is not able to amortize VideoCore IV’s SIMD processing capability the communication costs, so that the sequential approach takes less time to run than the parallel approach Motivation: • The Video Core IV provides four QPUs, each is a 16-way SIMD processor • Each processor has two powerful vector floating-point ALU • The Raspberry Pi 2 is inexpensive, consumes little power, and is developed by many hardware vendors Figure 5: Assembler Framework for allowing interaction between host and device • The Raspberry Pi 2 GPU is programmable • helloworld.asm is an assembly code program that controls the registers to the • Software applications that can successfully employ the Raspberry Pi’s QPU will enjoy operands, memory access to the QPU, interaction with the host memory, and energy and performance benefits termination of the program • assembler.cpp translates the assembly to binary code that the QPU understands • helloworld.bin is the binary code sent to the QPU’s driver.c executable. BACKGROUND • driver.c controls the access to the QPU[3] • Our research focuses on modify assembler.cpp and *.asm for a high level Parallel Programming Raspberry Pi 2, Model B GPU parallel application . Simultaneous execution of threads in a given Architecture instruction (SIMD) improves runtime efficiency [1] . Supported by Broadcom, which does not support CUDA or OpenCL platforms Instruction Pool . Contains 4 QPUs with 16-way SIMD processors for parallelization [2] 1 . Each QPU has 2 ALUs to perform vector 2 floating-point multiplication and addition 3 . The Video Core IV can obtain 24 GFLOPS in 4 Data Data Pool floating point throughput SIMD Runtime

1 2 3 4 5 Sequential Runtime Figure 1: Runtime Comparison between Parallel and Sequential Programming Figure 7: QPU Pipeline with two ALUs necessary to perform vector floating-point addition and multiplication.

Figure 3: Raspberry Pi 2, Model B CONCLUSIONS

GPU Processing Power Cost (USD) (GFLOPS) 1. We have evaluated the parallel performance of the Raspberry Pi and compare Raspberry Pi 2, Model B 24 35 it to its sequential performance

Maxwell 980Ti 5630 649.99 2. We have shown that the Raspberry Pi is able to run parallel computations Tesla K20c 1170 2999.95 3. The small number of threads available on the Raspberry PI limits the benefits of parallelism for the application studied Figure 2: Traditional CUDA Template for Parallelization Figure 4: Cost-Benefit Analysis of Raspberry Pi to other GPU Architectures 4. Future research will continue to investigate parallel applications that reap benefits on the Raspberry Pi given the limited number of threads Figure 6: VideoCore IV 3D Architecture, with a QPS Objective to compute threads in parallel or in sequential for the Quad Processor. REFERENCES [1] David B. Kirk and Wen-mei W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-On Approach (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. The goal of this research is to leverage Raspberry Pi to perform [2] “Hacking The GPU For Fun And Profit (Pt. 1).” Raspberry Pi Playground. 3 May 2014. Web. 11 Feb. 2016. parallel computing by accessing its QPU [3] Broadcom Co. VideoCore IV 3D Architecture Reference Guide. Irvine: Broadcom Co., 2013.