A TMD-MPI/MPE Based Heterogeneous Video System
Total Page:16
File Type:pdf, Size:1020Kb
i A TMD-MPI/MPE Based Heterogeneous Video System by Tony Ming Zhou Supervisor: Professor Paul Chow April 2010 ii Abstract: The FPGA technology advancements have enabled reconfigurable large-scale hardware system designs. In recent years, heterogeneous systems comprised of embedded processors, memory units, and a wide-variety of IP blocks have become an increasingly popular solution to building future computing systems. The TMD- MPI project has evolved the software standard message passing interface, MPI, to the scope of FPGA hardware design. It provides a new programming model that enabled transparent communication and synchronization between tasks running on heterogeneous processing devices in the system. In this thesis project, we present the design and characterization of a TMD-MPI based heterogeneous video processing system. The system is comprised of hardware peripheral cores and software video codec. By hiding low-level architectural details from the designer, TMD-MPI can improve development productivity and reduce the level of difficulty. In particular, with the type of abstraction TMD-MPI provides, the software video codec approach is an easy-to-entry point for hardware design. The primary focus is the functionalities and different configurations of the TMD-MPI based heterogeneous design. iii Acknowledgements I would like to thank supervisor Professor Paul Chow for his patience and guidance, and the University of Toronto and the department of Engineering Science for the wonderful five year journey that led me to this point. Special thanks go to the TMD-MPI research group, in particular to Sami Sadaka, Kevin Lam, Kam Pui Tang, and Manuel Saldaña. Last but not the least, I would like to thank my family and my friends for always being there for me. Stella, Kay, Grace, Amy, Rui, Qian, Chunan, and David, you have painted the colours of my university life. iv Glossary MPI Message Passing Interface API Application Program Interface FIFO First-In-First-Out NetIf Network Interface MPE Message Passing Engine TMD Originally meant Toronto Molecular Dynamics machine, but this definition was rescinded as the platform is not limited to Molecular Dynamics. The name was kept in homage to earlier TM-series projects at the University of Toronto VGA Video Graphics Array RGB Red-Green-Blue Colour Model FPS Frames Per Second DVI Digital Visual Interface FSL Xilinx Fast Simplex Link FIFO First-In-First-Out HDL Hardware Description Language TX Transmission/Transmitting RX Reception/Receiving MPMC Multi-Ported Memory Controller BRAM Xilinx Block RAM PLB Processor Local Bus v Contents 1 Introduction 1.1Motivation …………………………………………………………………………………. 1 1.2 Objectives …………………………………………………………………………………. 1 2 Background 2.1 Literature Review ……………………………………………………………………. 2 2.2 Distributed/Shared Memory Approaches ………………………..………… 3 2.3 The Building Blocks ………………………..………………………………………… 4 3 Methods and Findings 3.1 The Video System in Software .………………………………………………….. 6 3.2 The Video System on FPGA ………………………….……………………………. 7 3.2.1 System Block Diagram ……………………………………………………… 7 3.2.2 Distributed Memory Model ………………………………………………. 8 3.2.3 Shared Memory Model ……………………………………………………… 10 4 Discussions and Conclusions 4.1 Software vs. Hardware ………………………………………………………………. 13 4.2 Conclusions and Future Directions ……………………………………………... 14 References ……………………………………………………………………………………... 16 Appendix A: Video System – Software Prototype …………………………. 17 Appendix B: Video System – Hardware System ……………………………. 21 Appendix C: File Structure ……………………………………………………………. 22 vi 1 Introduction 1.1 Motivation Chip development has become increasingly difficulty due to transistor physical scaling limitations. Parallel processing stands out as one of the best alternative solutions for performance improvements. Message Passing Interface, or MPI, is a specification for an API that allows computers to communicate with one another. After over a decade of development, it has become the de facto standard for communications among software processes that model a parallel program with distributed memory. Hardware engines are generally better suited for parallel applications compared to software. Modern day FPGA technology has enabled advances in hardware design. With the aid of HDL and FPGA’s reprogramability, software program can now be accelerated in hardware without the high cost of ASIC design. However, unlike low level hardware design, high level system integration can be complex and time consuming. Professor Paul Chow and his research group at UofT have built a lightweight subset implementation of the MPI standard, called TMD-MPI. It provides software and hardware middleware layers of abstraction for communications to enable the portable interaction between embedded processors, CEs and x86 processors. Previous work demonstrated that TMD-MPI is a feasible high-level programming model for multiple embedded processors, but complex systems with heterogeneous processing units are yet to be tested. [1] 1.2 Objectives The development of TMD-MPI is still in the stage of infancy compared to the MPI standard. Implementations and characterizations of designs are lacking. This undergraduate thesis project attempts to fill this gap by the means of a TMD-MPI based heterogeneous video system design and characterization. 1 After a simple feasible heterogeneous system was successfully demonstrated, this thesis will focus on expanding the software element network to exploit more parallelism. 2. Background 2.1 Literature Review Although heterogeneous systems have numerous performance and energy advantages, design complexity remains a major factor limiting its use. Successful designs require developer expertise in multiple languages and tools. For instance, a typical FPGA heterogeneous system engineer should possess the knowledge of HDLs, software coding, interface details between source and destination engines, CAD tools, and vendor-specific FPGA details. Ideally, a specialized hardware/software element of the system could be designed independently from the other elements, yet still be portable and easily integratable into the overall system. TMD-MPI achieves this by abstracting away the details of coordinations between different task-specific elements, and in addition, it provides an easy-to-use entry point. Similar attempts have been made by OpenFPGA and NSF CHREC: a) Open FPGA has released General API Specification 0.4 in 2008 in attempt to propose an industry standard API for high-level language access to reconfigurable FPGAs in a portable manner. [2] The scope of TMD-MPI is much larger, because the types of interactions in GenAPI are very limited as it is only focusing on the low-level X86-FPGA interaction, but not dealing with higher levels. b) NSF CHREC on the other hand, developed a conceptually similar framework adopting the message-passing approach. A more careful inspection reveals the differences. The hardware and software elements in their SCF heterogeneous design are statically mapped to one another [3]. In contrast, the mapping of TMD-MPI nodes is dynamically defined, implying that point-to-point communication paths can be redirected during run-time for more versatility. 2 2.2 Distributed/Shared Memory Approaches The primary goal of this thesis focuses on functionality rather than performance. Speed and performance considerations aside, two approaches from the high level perspective can be adopted. The first is a distributed memory system where all processing unit are equipped with local memories. The fact that local data is not accessible by ranks other than its owner implies the video frames must be passed as messages from one rank to another. Video streaming has a unidirectional flow of data, which makes it a well suited application for the distributed memory approach. The second is a shared memory approach, where all the processing units share a common memory space. Under the conditions that video data are properly managed in memory by a special engine and the memory interface is not port limited, the memory contents would be accessible by all the processing units. The result of such approach suggests that the task designation to processing units can be as simple as passing memory addresses. The desired video frame has a size of 640px by 480px, in 32-bit RGA format, which is equivalent to 1200 kilobytes. If the desired frame rate is 30 FPS, then the shared memory system’s network must be able to handle roughly 1200KB * 30 = 35MB/s. The shared memory approach introduces a less network traffic-intensive way of communication by passing 32-bit addresses as messages. The simplest application potentially is only required to broadcast a base memory address location to all processing units, resulting in a total network data traffic of 32 bits per rank. The drawbacks of shared memory approach are the longer memory access time, and proneness to data corruption. Comparison wise, the shared memory approach exhibits a more asynchronous nature, if a section of memory is assigned to more than one codec, racing conditions can occur to cause premature or delayed memory updates. Moreover, if a FPGA is not robust and had caused a bit-flip in a 3 message on the network, a bad memory address in the shared memory model is more likely to result in a catastrophic failure than a bad pixel in the distributed memory model. 2.3 The Building Blocks In this project, point-to-point communication channels are implemented using Xilinx Fast Simplex Links, which are essentially FIFOs. The FIFO is a powerful abstraction for on-chip