Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

Linköping Studies in Science and Technology Dissertation No. 1532 Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor Jian Wang Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Linköping 2014 ISBN 978-91-7519-556-8 ISSN 0345-7524 ii Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor Jian Wang ISBN 978-91-7519-556-8 Copyright ⃝c Jian Wang, 2014 Linköping Studies in Science and Technology Dissertation No. 1532 ISSN 0345-7524 Department of Electrical Engineering Linköping University SE-581 83 Linköping Sweden Phone: +46 13 28 10 00 Author e-mail: [email protected] Cover image Combined Star and Ring onchip interconnection of the ePUMA multicore DSP. Parts of this thesis are reprinted with permission from the IEEE. Printed by UniTryck, Linköping University Linköping, Sweden, 2014 Abstract The physical scaling following Moore’s law is saturated while the re- quirement on computing keeps growing. The gain from improving silicon technology is only the shrinking of the silicon area, and the speed- power scaling has almost stopped in the last two years. It calls for new parallel computing architectures and new parallel programming methods. Traditional ASIC (Application Specific Integrated Circuits) hardware has been used for acceleration of Digital Signal Processing (DSP) subsys- tems on SoC (System-on-Chip). Embedded systems become more com- plicated, and more functions, more applications, and more features must be integrated in one ASIC chip to follow up the market requirements. At the same time, the product lifetime of a SoC with ASIC has been much reduced because of the dynamic market. The life time of the design for a typical main chip in a mobile phone based on ASIC acceleration is about half a year and the NRE (Non-Recurring Engineering) cost of it can be much more than 50 million US$. The current situation calls for a new solution than that of ASIC. ASIP (Application Specific Instruction set Processor) offers comparable power consumption and silicon cost to ASICs. Its greatest advantage is the functional flexibility in a predefined application domain. ASIP based SoC enables software upgrading without changing hardware. Thus the product life time can be 5-10 times more than that of ASIC based SoC. This dissertation will present an ASIP based SoC, a new unified parallel DSP subsystem named ePUMA (embedded Parallel DSP Platform iii iv with Unique Memory Access), to target embedded signal processing in communication and multimedia applications. The unified DSP subsystem can further reduce the hardware cost, especially the memory cost, of embedded SoC processors, and most importantly, provide full pro- grammability for a wide range of DSP applications. The ePUMA processor is based on a master-slave heterogeneous multi-core architecture. One master core performs the central control, and multiple Single In- struction Multiple Data (SIMD) coprocessors work in parallel to offer a majority of the computing power. The focus and the main contribution of this thesis are on the memory subsystem design of ePUMA. The multi-core system uses a distributed memory architecture based on scratchpad memories and software controlled data movement. It is suitable for the data access properties of streaming applications and the kernel based multi-core computing model. The essential techniques include the conflict free access parallel memory architecture, the multi-layer interconnection network, the non-address stream data transfer, the transitioned memory buffers, and the lookup table based parallel memory addressing. The goal of the design is to minimize the hardware cost, simplify the software protocol for inter-processor communication, and increase the arithmetic computing efficiency. We have so far proved by applications that most DSP algorithms, such as filters, vector/matrix operations, transforms, and arithmetic functions, can achieve computing efficiency over 70% on the ePUMA platform. And the non-address stream network provides equivalent communication bandwidth by less than 30% implementation cost of a crossbar interconnection. Preface This thesis includes the following papers published during my research from August 2008 to July 2013: • Jian Wang, Andréas Karlsson, Joar Sohl, and Dake Liu. Convo- lutional decoding on deep-pipelined SIMD processor with flexible parallel memory. In Digital System Design (DSD), 2012, pages 529- 532. IEEE, 2012. • Jian Wang, Joar Sohl, Andréas Karlsson, and Dake Liu. An efficien- t streaming star network for multi-core parallel DSP processor. In Proceedings of the 2011 Second International Conference on Net- working and Computing, ICNC ’11, pages 332-336, Washington, DC, USA, 2011. IEEE Computer Society. • Jian Wang, Andréas Karlsson, Joar Sohl, Magnus Pettersson, and D. Liu. A multi-level arbitration and topology free streaming network for chip multiprocessor. In ASIC (ASICON), 2011 IEEE 9th International Conference on, pages 153-158, oct. 2011. • Jian Wang, Joar Sohl, and Dake Liu. Architectural support for re- ducing parallel processing overhead in an embedded multiprocessor. In Proceedings of the 2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, EUC ’10, pages 47-52, Washington, DC, USA, 2010. IEEE Computer Society. • Jian Wang, Joar Sohl, Olof Kraigher, and Liu Dake. Software programmable data allocation in multi-bank memory of simd proces- v vi sors. In Proceedings of the 2010 13th Euromicro Conference on Dig- ital System Design: Architectures, Methods and Tools, pages 28-33. IEEE Computer Society, 2010. • Jian Wang, Joar Sohl, Olof Kraigher, and Dake Liu. ePUMA: A novel embedded parallel DSP platform for predictable computing. In Education Technology and Computer (ICETC), 2010 2nd Inter- national Conference on, volume 5, pages V5-32 -V5-35, june 2010. I am also co-author of the following publications: • Joar Sohl, Jian Wang, Andréas Karlsson, and Dake Liu. Automat- ic permutation for arbitrary static access patterns. In Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on, pages 215-222, july 2012. • Dake Liu, Andréas Karlsson, Joar Sohl, Jian Wang, Magnus Peters- son, and Wenbiao Zhou. ePUMA embedded parallel DSP processor with unique memory access. In Information, Communications and Signal Processing (ICICS) 2011 8th International Conference on, pages 1-5, dec. 2011. • Dake Liu, Joar Sohl, and Jian Wang. Parallel computing and it- s architecture based on data access separated kernels. IJERTCS, International Journals Embedded and Real-Time Communication systems. • Joar Sohl, Jian Wang, and Dake Liu. Large matrix multiplication on a novel heterogeneous parallel DSP architecture. In Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies, APPT ’09, pages 408-419, Berlin, Heidelberg, 2009. Springer-Verlag. Acknowledgments I would like to take this opportunity to express my gratitude to many people for their support during my study at Linköping university. I would like to thank the following: • Professor Dake Liu for the opportunity to work on this interesting project and for the advices during my study. • Professor Torkel Glad for the support and supervision in the final year of my study. • The head of Computer Engineering division Tomas Svensson for the support and encouragement during my time at the division. • My co-supervisor Dr. Andreas Ehliar for the advices in technical details and guide in writing this thesis. • PhD candidates Joar Sohl and Andréas Karlsson for great coopera- tion in the ePUMA project, for all the fruitful discussions and col- laboration in publications. And all PhD students I have met during my study for the comments and ideas about my work. • Ylva Jernling for your help in administration affairs, Olle Seger for the help in teaching and the valuable comments on this dissertation, and Anders Nilsson for the support in computer and software. And everyone else at Computer Engineering division, it has been a great time to work with you all. vii viii • Dr. Di Wu for all interesting discussions, valuable suggestions in my research and my career, and for the friendship. • All my other friends for the good time at Linköping. • My wife Fan Zhang for her love, support, encouragement, and un- derstanding during our time in Sweden. • My mother Zhongxia Yang, my sister Wenjing Wang and her hus- band Runlong Gong, and all my families for their constant support and encouragement. This research was supported by the Swedish Foundation for Strategic Research (SSF). Jian Wang Lund, March 2014 Contents Abbreviations xv I Background 1 1 Introduction 3 1.1 Motivation . 5 1.1.1 Programmable Multicore DSP . 5 1.1.2 DRAM and SPM Based Memory Subsystem . 6 1.1.3 Parallel Local Memory Architecture . 7 1.1.4 Design Goal and Challenge . 8 1.2 Contributions . 10 1.3 Thesis Organization . 12 2 High Performance Streaming Computing 15 2.1 Computing Characteristics . 15 2.2 Computation Kernels . 18 2.3 Data Flow Based Signal Processing . 21 2.3.1 LTE Uplink Baseband Signal Processing . 21 2.3.2 H.264 Decoder . 22 2.3.3 3D Rendering Pipeline . 23 3 Parallel Processor Architectures 25 3.1 Introduction . 25 3.2 Instruction Level Parallel Execution . 26 ix x Contents 3.2.1 Superscalar . 26 3.2.2 VLIW . 26 3.3 SIMD Architecture for Data Level Parallelism . 28 3.3.1 SIMD Arithmetic Functions . 28 3.3.2 SIMD Data Accesses . 29 3.3.3 SWAR Extension vs. SIMD Engine . 31 3.4 Chip Multiprocessor . 31 3.4.1 Multicore Computing Model . 32 3.4.2 Shared Memory vs. Distributed Memory . 34 3.4.3 Interconnection Architecture . 36 3.5 Performance Obstacles . 38 3.5.1 Streaming Data Movement . 39 3.5.2 Parallel Data Access for a SIMD Datapath . 40 3.5.3 DRAM Memory Bandwidth Efficiency . 41 3.5.4 Memory Coherency Overhead . 43 3.5.5 Control Overhead . 44 3.6 Programmable Architectures for Streaming Applications .

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

Elementary Functions: Towards Automatically Generated, Efficient

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

Run-Time Adaptation to Heterogeneous Processing Units for Real-Time Stereo Vision

A Compiler Target Model for Line Associative Registers

Prozessorarchitektur Am Beispiel Des Amdathlon

Intel ® Atom™ Processor E6xx Series SKU for Different Segments” on Page 30 Updated Table 15

Adding Support for Vector Instructions to 8051 Architecture

Datasheet, Volume 2

A Bibliography of Publications in IEEE Micro

High Performance Motion Detection: Some Trends

Optimizing SIMD Execution in HW/SW Co-Designed Processors

MPC500 Family MPC509 User's Manual