i

A TMD-MPI/MPE Based Heterogeneous Video System

by Tony Ming Zhou

Supervisor: Professor Paul Chow April 2010

ii

Abstract:

The FPGA technology advancements have enabled reconfigurable large-scale hardware system designs. In recent years, heterogeneous systems comprised of embedded processors, memory units, and a wide-variety of IP blocks have become an increasingly popular solution to building future systems. The TMD- MPI project has evolved the software standard message passing interface, MPI, to the scope of FPGA hardware design. It provides a new programming model that enabled transparent communication and synchronization between tasks running on heterogeneous processing devices in the system. In this thesis project, we present the design and characterization of a TMD-MPI based heterogeneous video processing system. The system is comprised of hardware peripheral cores and software video codec. By hiding low-level architectural details from the designer, TMD-MPI can improve development productivity and reduce the level of difficulty. In particular, with the type of abstraction TMD-MPI provides, the software video codec approach is an easy-to-entry point for hardware design. The primary focus is the functionalities and different configurations of the TMD-MPI based heterogeneous design.

iii

Acknowledgements

I would like to thank supervisor Professor Paul Chow for his patience and guidance, and the University of Toronto and the department of Engineering Science for the wonderful five year journey that led me to this point. Special thanks go to the TMD-MPI research group, in particular to Sami Sadaka, Kevin Lam, Kam Pui Tang, and Manuel Saldaña. Last but not the least, I would like to thank my family and my friends for always being there for me. Stella, Kay, Grace, Amy, Rui, Qian, Chunan, and David, you have painted the colours of my university life.

iv

Glossary

MPI Message Passing Interface

API Application Program Interface

FIFO First-In-First-Out

NetIf Network Interface

MPE Message Passing Engine

TMD Originally meant Toronto Molecular Dynamics machine, but this definition was rescinded as the platform is not limited to Molecular Dynamics. The name was kept in homage to earlier TM-series projects at the University of Toronto

VGA Video Graphics Array

RGB Red-Green-Blue Colour Model

FPS Frames Per Second

DVI Digital Visual Interface

FSL Fast Simplex Link

FIFO First-In-First-Out

HDL Hardware Description Language

TX Transmission/Transmitting

RX Reception/Receiving

MPMC Multi-Ported Memory Controller

BRAM Xilinx Block RAM

PLB Processor Local Bus

v

Contents

1 Introduction 1.1Motivation …………………………………………………………………………………. 1 1.2 Objectives …………………………………………………………………………………. 1

2 Background 2.1 Literature Review ……………………………………………………………………. 2 2.2 Distributed/Shared Memory Approaches ………………………..………… 3 2.3 The Building Blocks ………………………..………………………………………… 4

3 Methods and Findings 3.1 The Video System in Software .………………………………………………….. 6 3.2 The Video System on FPGA ………………………….……………………………. 7 3.2.1 System Block Diagram ……………………………………………………… 7 3.2.2 Distributed Memory Model ………………………………………………. 8 3.2.3 Shared Memory Model ……………………………………………………… 10

4 Discussions and Conclusions 4.1 Software vs. Hardware ………………………………………………………………. 13 4.2 Conclusions and Future Directions ……………………………………………... 14

References ……………………………………………………………………………………... 16

Appendix A: Video System – Software Prototype …………………………. 17

Appendix B: Video System – Hardware System ……………………………. 21

Appendix : File Structure ……………………………………………………………. 22

vi

1 Introduction 1.1 Motivation Chip development has become increasingly difficulty due to transistor physical scaling limitations. Parallel processing stands out as one of the best alternative solutions for performance improvements. Message Passing Interface, or MPI, is a specification for an API that allows computers to communicate with one another. After over a decade of development, it has become the de facto standard for communications among software processes that model a parallel program with distributed memory.

Hardware engines are generally better suited for parallel applications compared to software. Modern day FPGA technology has enabled advances in hardware design. With the aid of HDL and FPGA’s reprogramability, software program can now be accelerated in hardware without the high cost of ASIC design.

However, unlike low level hardware design, high level system integration can be complex and time consuming. Professor Paul Chow and his research group at UofT have built a lightweight subset implementation of the MPI standard, called TMD-MPI. It provides software and hardware middleware layers of abstraction for communications to enable the portable interaction between embedded processors, CEs and x86 processors. Previous work demonstrated that TMD-MPI is a feasible high-level programming model for multiple embedded processors, but complex systems with heterogeneous processing units are yet to be tested. [1]

1.2 Objectives The development of TMD-MPI is still in the stage of infancy compared to the MPI standard. Implementations and characterizations of designs are lacking. This undergraduate thesis project attempts to fill this gap by the means of a TMD-MPI based heterogeneous video system design and characterization.

1

After a simple feasible heterogeneous system was successfully demonstrated, this thesis will focus on expanding the software element network to exploit more parallelism.

2. Background 2.1 Literature Review Although heterogeneous systems have numerous performance and energy advantages, design complexity remains a major factor limiting its use. Successful designs require developer expertise in multiple languages and tools. For instance, a typical FPGA heterogeneous system engineer should possess the knowledge of HDLs, software coding, interface details between source and destination engines, CAD tools, and vendor-specific FPGA details. Ideally, a specialized hardware/software element of the system could be designed independently from the other elements, yet still be portable and easily integratable into the overall system. TMD-MPI achieves this by abstracting away the details of coordinations between different task-specific elements, and in addition, it provides an easy-to-use entry point.

Similar attempts have been made by OpenFPGA and NSF CHREC:

a) Open FPGA has released General API Specification 0.4 in 2008 in attempt to propose an industry standard API for high-level language access to reconfigurable FPGAs in a portable manner. [2] The scope of TMD-MPI is much larger, because the types of interactions in GenAPI are very limited as it is only focusing on the low-level X86-FPGA interaction, but not dealing with higher levels.

b) NSF CHREC on the other hand, developed a conceptually similar framework adopting the message-passing approach. A more careful inspection reveals the differences. The hardware and software elements in their SCF heterogeneous design are statically mapped to one another [3]. In contrast, the mapping of TMD-MPI nodes is dynamically defined, implying that point-to-point communication paths can be redirected during run-time for more versatility.

2

2.2 Distributed/Shared Memory Approaches The primary goal of this thesis focuses on functionality rather than performance. Speed and performance considerations aside, two approaches from the high level perspective can be adopted.

The first is a distributed memory system where all processing unit are equipped with local memories. The fact that local data is not accessible by ranks other than its owner implies the video frames must be passed as messages from one rank to another. Video streaming has a unidirectional flow of data, which makes it a well suited application for the distributed memory approach.

The second is a shared memory approach, where all the processing units share a common memory space. Under the conditions that video data are properly managed in memory by a special engine and the memory interface is not port limited, the memory contents would be accessible by all the processing units. The result of such approach suggests that the task designation to processing units can be as simple as passing memory addresses.

The desired video frame has a size of 640px by 480px, in 32-bit RGA format, which is equivalent to 1200 kilobytes. If the desired frame rate is 30 FPS, then the shared memory system’s network must be able to handle roughly 1200KB * 30 = 35MB/s. The shared memory approach introduces a less network traffic-intensive way of communication by passing 32-bit addresses as messages. The simplest application potentially is only required to broadcast a base memory address location to all processing units, resulting in a total network data traffic of 32 bits per rank.

The drawbacks of shared memory approach are the longer memory access time, and proneness to data corruption. Comparison wise, the shared memory approach exhibits a more asynchronous nature, if a section of memory is assigned to more than one codec, racing conditions can occur to cause premature or delayed memory updates. Moreover, if a FPGA is not robust and had caused a bit-flip in a

3 message on the network, a bad memory address in the shared memory model is more likely to result in a catastrophic failure than a bad pixel in the distributed memory model.

2.3 The Building Blocks In this project, point-to-point communication channels are implemented using Xilinx Fast Simplex Links, which are essentially FIFOs. The FIFO is a powerful abstraction for on-chip communication that is able handle the bandwidth of this video system. Because the FSL are unidirectional, they are implemented in pairs for the transmission and reception of data. In the special case of CE’s interface with TMD-MPE, an extra pair of command FIFOs is required exclusively for MPI commands (Fig. 1).

Figure 1. The FIFO pairs act as point-to-point communication channels.

The software elements are implemented using Xilinx Microblaze soft-core processors. Message passing protocol is instantiated at compile time by declaring the TMD-MPI library in the source code. An additional message passing engine, or MPE, was also created to perform a subset of message passing functions in hardware. The hardware elements can be designed in any HDL, but will always require a TMD-MPE to be able to connect to the network.

Figure 2. Simplified NetIf diagram showing the TX and RX muxes. Left, RX half with RX FIFOs. Right, TX half with TX FIFOs.

4

In order to enable dynamic mapping capability for the nodes, extra state machines are required for each element/core. The Network Interface block provides such path routing functions to eliminate the need of extra states for each core. Another issue arises as the system gets large, the Microblaze soft-processor only has eight FSL channels available, limiting the number of nodes that can be connected for communication. The Network Interface block also solves this issue extracting the channel multiplexing away from Microblazes. It functions as a multiplexer controlled by the destination rank in the TX half, and a demultiplexer controlled by the source rank in the RX half (Fig. 2). Connections wise, on one side of the NetIf, a pair of Xilinx FSLs connects the NetIf and the current node; on the other side of the NetIf, it may be connected to all the other NetIfs for maximized freedom (Fig. 3a), or less channels for improved performance at the cost of worse system visibility (Fig. 3b).

Figure 3. Different ways of interconnecting the nodes, a NetIf is attached to each element, a (left), b (right).

Lastly, this thesis project was built upon an available video system by Professor Paul Chow’s summer student Jeff Goeders. The three-node system is a scalable TMD-MPI based video processing framework, it currently supports video streaming from the VGA port to the DVI port on the Xilinx Vertex-5 board [5].

5

3 Methods and Findings 3.1 The Video System in Software Implementing TMD-MPI requires first understanding how to use the MPI specification. Logically, the first step which took place was building a video frame processor prototype entirely in software.

This prototype application utilizes the multiple CPU cores available on a PC, and if available cores are insufficient, the additional non-existing cores are simulated as additional processes on the existing cores. In the application, input and output video frames were replaced with bitmap images of the same RGB format, the processing codec was coded in C++, and both distributed memory and shared memory models were implemented with the aid of readily available software libraries.

Results:

Two instances of the application are shown below in Figure 4 and 5. Memory activity have been omitted in the figures. The black arrows in Figure 4 symbolizes the flow of actual frame data as messages, whereas the black arrows in Figure 5 symbolizes the flow of memory pointers as messages. Although streaming is better suited for a distributed memory model, streaming and parallel processing applications can be interchanged for the two models.

ADD NOISE INVERT COLOUR FLIP IMAGE

Figure 4. Streaming using distributed memory message passing model.

6

Data passed to ranks as pointers in memory

Parallely processed data sent to the output node as a pointer

Figure 5. Parallel processing using shared memory message passing model.

3.2 The Video System on FPGA 3.2.1 System Block Diagram:

Figure 6. The heterogeneous video processing system block diagram

7

The system block diagram in Figure 6 lays out the overall picture of the heterogeneous video processing system. The Network Interface Block in the centre comprised of many interconnected NetIfs is shown as a big block for simplicity. The top region above the NetIfs Block consists of hardware elements. Hardware CEs interface with the system network through TMD-MPEs. Rank 1 is a video decoder that takes VGA input and places it out the system network, and Rank 2 is a video receiver that gets the video frames from the network and stores it in the external memory. All of the software elements reside below the NetIfs Block , there is the special Rank 0 process and a network of Microblazes and several specialized codec- related processes labeled by Rank 3-N. Software elements interface with the system network through the TMD-MPI software library.

Not only is the 256MB of external memory ported for Rank 2’s memory storage and the DVI-out core’s video output, it is also useful when a hardware engine’s local memory is insufficient, or a shared memory model is implemented. MPMC is the memory interface between the system and the external memory.

In MPI, nodes identify each other by ranks. The sending and receiving of messages require source and destination ranks. Rank 0 acts as the central command centre of the system that initializes all the ranks at the start of runtime, and it also configures and reconfigures the mapping between ranks during runtime. In the codec network of Microblazes labeled Rank 3-N, the hardware settings may or may not be identical depending on the peripheral devices needed for each process. In general, the quickest and most efficient implementation of codec processes is duplicating a fully functioning Microblaze and its peripherals and settings, but loading the copies with different source code for codec-specific functions.

3.2.2 Distributed Memory Model:

Jeff Goeders’ framework demonstrated video streaming from a video decoder core (Rank 1) to a memory storage core (Rank 2). Eventually, the video is outputted to a DVI port. The communication channel between the two cores is established based on TMD-MPI rank-to-rank model described earlier. Therefore, a Microblaze

8

(Rank 3) can be inserted between the streaming paths of the two cores by only changing the destination rank of Rank 1 and source rank of Rank 2. Functionality wise, the following tasks must be performed by the Microblaze:  Receive frames from Rank 1 in units of 640 x 480 frames  Apply the video codec effects  Send modified frames out to Rank 2 in units of 640 x 480 frames.

The requirements above pose several challenges. First of all, the Microblaze must operate at more than double the rate Rank 1 and Rank 2 respectively send and receive data. In addition, extra clock cycles would be needed for video processing. The most direct solution involves parallelizing the task by dividing each frame into smaller units that are equally distributed among multiple Microblazes.

The second challenge is the local memory constraint on the Microblaze. Messages are sent and received in the unit of frames of 1200KB in size. Unlike the hardware engines, Microblaze suffers from the higher level interaction it has with FIFOs. Hardware engines are able to receive and send data on 32-bit-FIFO-data-entry basis, but Microblazes must received each 1200KB message as a whole. The distributed memory and BRAM available to Microblaze are both too small for local storage; the only option is the external off-chip memory. Note that using an external memory slightly violates the principles of a true distributed local memory model.

Results:

As expected, a single Microblaze based codec suffers from poor performance. Although the soft-core processor operates at the same system clock frequency of 100Mhz, the effective rate of video streaming results in 1-10Mhz (Fig. 7).

Figure 7. The Microblaze as the bottleneck in the streaming path.

9

There are three factors contributing to the slowdown of the Microblaze. First, Microblazes interface with the Xilinx FSL less efficiently than hardware does. The extra clock cycles needed for the Microblaze reoccurs for each 32-bit data entry or each pixel on the FIFO. Secondly, due to the large size of a video frame (1200KB), a local memory approach is not applicable, an external off-chip memory is used instead. As the external memory must share the PLB bus with other peripheral devices, bus arbitration and the extra traffic introduce significant delay (Fig. 8). The external memory is off the FPGA chip, both the complex MPMC interface and long physical distance translate to more delay. The last factor would be the implicit sequential execution of instructions in a normal processor [1]. However, contributions of the last factor are small since a well designed video codec should be well cached and pipe-lined by the processor.

Figure 8. The peripheral devices that are connected to the Microblaze in this video system.

The number of remaining ports on the MPMC limits the number of codec Microblazes to six. Even if they are perfectly parallelized, the combined effective frequency is 60Mhz – still not caught up to the speed of other cores. Clearly, the design needed a new direction, the shared memory model is introduced.

3.2.3 Shared Memory Model:

The shared memory approach, as described in an earlier section, enjoys the benefits of significantly reduced network traffic at the cost of memory access time. Implementations of the model may vary, but the key idea is such that there are a finite number of messages sent to the codec Microblazes. In this project, six 32-bit

10 signals is sent to the codec Microblaze, they are related to the base and high addresses of the video, the type of codec to run, and other control signals. The number of Xilinx FSL interface delay cycles associated with the six messages is negligible compared to the per pixel basis delay cycles associated with the distributed memory model.

Note that the shared memory approach is at a disadvantage in terms of memory access time only when it is compared to a true distributed memory model. Due to the limited amount of local memory available to Microblazes, an external memory has been used for the distributed memory system in this project. Given such, simple analysis shows that the two models practically exhibit identical memory access time: the distributed memory modeled Microblaze first writes to memory as it receives the data, then reads the data back for processing and transferring to the next node; the shared memory modeled Microblaze first reads the data for processing, later it updates the memory contents. Exactly one read and one write took place in both models.

The same framework by Jeff Goeders has been used as the groundwork the shared memory model is built upon. The codec related Microblazes are placed on the system without source and destination changes to Rank 1 and Rank 2. Memory space management has become a crucial task in this model. The video storage core and the DVI-out core memory spaces are separated from each other, and the codec Microblaze’s memory access spans both spaces as it is the agent that transfers the data from one space to the other (Fig. 9).

Figure 9. Microblaze spans two spaces as both the video processor and transferor.

11

Results:

The speed improvement is evident and linearly scalable as far as the measurements have shown (Fig. 10). The frame rate up to four Microblazes was measured, a linear trend has been observed. The codec effects include darkening effect, addition of noise, colour inversion, and colour change. The spread of data is expected for that different codec were applied during multiple runs.

The FPS was measured by a special function within each Microblaze as opposed to being measured at the DIV-out core. Therefore, despite the limited number of available MPMC ports, the FPS benchmark for an increased number of Microblazes can still be simulated by reducing the task handled by each Microblaze.

In the shared memory model, most of the message passing occur between the special Rank 0 node and the codec Microblazes. Extra measures must be taken to ensure proper sequence of tasks for data correctness. As a result, the software code in this model tends to be more complex.

Figure 10. Microblaze spans two spaces as both the video processor and transferor.

12

4 Discussions and Conclusions 4.1 Software vs. Hardware All of the software cores except for Rank 0 are replaceable with hardware cores. One of the main goals is seeking the effectiveness of software approach to TMD-MPI programming.

The functionalities of the TMD-MPI library far exceed its hardware counterpart - the TMD-MPE. There are 25 MPI commands available in the TMD-MPI library, whereas the TMD-MPE has only 3 MPI commands in total - synchronous send, asynchronous send, and receive.

For the design of these codec, given the same task and clock, a specialized hardware engine is expected to outperform the software process. Hardware engines can be better optimized to for specialized tasks and are better suited for parallel applications, on the other hand, processors are general purpose oriented, making them less suited for specialized tasks. Moreover, experimental results in the previous sections suggest that the software processes are associated with more delay, this is another major factor limiting a software process’s performance.

In terms of development speed and difficulty, the software method has clear advantages because of better scalability and reduced compile time. For instance, functionally different processors are structurally identical in hardware, thus scaling the software processes is as simple as duplicating the existing Microblaze and its settings. Compared to hardware development, the compile and debugging time in software is significantly reduced: on average, regenerating the system bitstream after modifying a codec takes 1 minute in software but 90 minutes in hardware.

Well designed hardware engines should occupy less area, but software core based codec have less development costs. The cost function should not be simply evaluated based on chip area and development cost, there are many other contributing factors. Thus, the comparison is inconclusive.

13

The comparisons drew above are summarized in the following table:

Software Hardware Functionality Very good Bad Performance Slow Very Fast Development Fast Slow Cost Inconclusive Inconclusive

4.2 Conclusions and Future Directions This thesis project is an implementation of TMD-MPI based heterogeneous video processing system. More specifically, the video processing units were implemented using the Microblaze soft-core processors to execute C programs in parallel. Characterizations and analysis have demonstrated TMD-MPI as a feasible and efficient approach for heterogeneous system design.

The performance drawbacks of software processes have limited the data throughput. As TMD-MPI provides a scalable solution to enable parallel processing, performance can be improved by duplicating current software processes. Despite that the shared memory model outperformed the distributed memory model, a true distributed local memory model was not achieved, and therefore such comparison is slightly unfair. Finally, the shared memory model provides more abstraction for the developer. By passing memory addresses as messages, it deals with less network traffic and less hardware modifications. Based on experiences throughout this project, the shared memory model is the more scalable solution.

The TMD-MPI programming model is common to both software and hardware. For the MPI commands, a simple script can be written for conversion between TMD- MPI and TMD-MPE commands. As C-to-HDL technology advance, automatic software code conversion is possible. The TMD-MPI approach suggests the possibility of an efficient automated method of hardware development for designers with little hardware background.

14

The following is a list for future directions:

 Implement more MPI commands into the TMD-MPE for better hardware functionality. Since many functions are built upon the three basic ones available in the MPE, a higher level MPE that utilizes TMD-MPE’s basic functions can be introduced.  Expand the software codec network, try more complex structures and parallelization for characterization. Because of the limited number of MPMC ports, the codec network of Microblazes may adopt both distributed and shared memory models. As control signals get complicated for Rank 0, local hierarchy methodologies like the tree diagram in Fig.3b might be needed.  Implement a multi-board system to explore a higher level of scalability. The building pieces are already available: Sami Sadaka has a homogeneous video processing system, and Kevin Lam has a gigabit Ethernet Bridge.  Develop a cost-effective method of automated C-to-HDL conversion so that developers can enjoy both benefits of efficient software development and hardware accelerated performance. One might wish to conduct an study on tools such as Nios II C-to-Hardware Acceleration , , and FPGAC. Although a completely different research topic, it is one that raises great interest for the TMD-MPI project.

15

References:

[1] M. Saldana, A. Patel, C. Madill, and P. Chow. “MPI as an Abstraction for Software-Hardware Interaction for HPRCs,” HPRCTA, 2008 Second International Workshop; Austin,TX,USA

[2] OpenFPGA Main Page, “OpenFPGA General API Specification 0.4,” OpenFPGA. [Online]. Available: http://www.openfpga.org/Standards%20Documents/OpenFPGA- GenAPIv0.4.pdf [Accessed: Feb 20th, 2010]

[3] V. Aggarwal, R. Garcia, A. George, and H. Lam, “SCF: A Device- and Language- Independent Task CoordinationFramework for Reconfigurable, Heterogeneous Systems,” HPRCTA, November 15, 2009; Portland, Oregon

[4] MPICH2: High-performance and Widely Portable MPI, [Online]. Available: http://www.mcs.anl.gov/research/projects/mpich2/ [Accessed: Feb 20, 2010].

[5] Jeffrey Goeders, “A Scalable, MPI-Based Video Processing Framework,” University of Toronto, August 2009.

[6] Manuel Saldana, “Message Passing Engine (MPE) User’s Guide,” ArchES Computing, September 2009.

16

Appendix A

Video System - Software Prototype Description This software MPI application is a picture frame parallel processing program. There are currently five defined ranks in the system with two distinct tests. The number of ranks can be easily expanded. The project infrastructure was organized in Microsoft Visual Studio/C++, and it must run under MPICH2 environment. Currently, a multi- core computer is not necessary due to MPICH2’s capability of simulating it as multi- process single-core program. However, the instructions for running the program on multiple computers are given below.

Key Functions: Void MPE_Master(): It is executed by Rank 0 only. The function first creates and initializes the shared memory. Then defines the tasks to be performed and assign them to the other ranks.

Void_MPE_Slave(): The slave function is executed by all non 0 ranks. The function contains a polling loop that is always polling for tasks from Rank 0, the received task is then translated and the corresponding codec is executed.

Void Codec (int add, int size, int rank): The codec function takes in three parameter, “add” determines the baseaddress in the memory for processing to start, size is essentially the size of frame to be executed, and lastly the rank parameter is needed to determine with codec to run.

Instruction for running on single computer: 1) Set up MPICH2 2) Go to the working directory, debug folder. 3) Run MPI program in the command prompt: “mpiexec –n 5 source.bmp mpi_test1.exe” “mpiexec –n 5 source2.bmp mpi_test2.exe”

Format: mpiexec –n arg1 arg2 executable arg1 – the number of processes/ranks arg2 – the input picture frame to be processed executable – your MPI program generated by the compiler

Instruction for running multiple computers: 1) Makesure that MPICH2 versions are the same in all computers/machines. 2) Copy the executable to the same directory in each machine (node).

17

For example “C:\Program Files\MPICH2\examples\cpi.exe” 3) Set Network Connections: Ensure that each machine can let its files shared by other computers 4) Set Windows Firewall: Ensure that Windows Firewall can allow files sharing by checking the option 5) Add MPICH2 path to Windows User Variables and System Variables. 6) Run MPI program in the command prompt. For example “mpiexec –hosts 2 domainnameA 1 domainnameB 1 c:\program files\mpich2\examples\cpi.exe”

Software Variable Definitions: 1) Tasks are MPI messages of type int array, assigned by Rank 0 through MPI_Send command. Example task declaration: “int taskname[TASKSIZE] = {0,1,640*480, 0, 0, 0, 0, 0};” Format: t[TASKSIZE] = { 0: source rank, 1: destination rank, 2: size of frame to access, 3: memory address/pointer of the frame, 4-7: unused }

2) Tags are transferred with each MPI message, it is used to determine the type of message this MPI command carries. TAG: the message only contains info about the source and destination. TAG_ext: the extended verion of the previous tag, the message only contains the source and destination rank, memory address and size is carried in the message as well. dieTAG: signal to shut down the current core.

3) A rank is a unique identity of each process. Rank 0 is always the system control centre for the initializing and task assigning of the system

Current Tests: Test 1: The memory pointer is passed from codec to codec, the entire frame is being operated on. Test 2: Different memory pointers are assigned by Rank 0. The frame is divided into three sections, which are handled by three different ranks/codecs/processes.

18

Appendix B

Video System – Hardware System Description The hardware system is a heterogeneous video processing system. It is built upon Jeff Goeders’ video streaming framework. The software ranks provide a scalable solution to video processing. This project has been developed as a proof-of-concept implementation. The hardware engines interface with the network through TMD- MPE v1.0, and the soft-core processors interface with the network through TMD-MPI library v1.0. The

Block Diagram:

Instruction for generating the bitstream: 1) Must first source the Xilinx ISE 10.1 suitte: “source /opt/xilinx10.1/10.1.sh” 2) In ./xps_streaming and ./xps folder, type: “make –f system.make clean” cleans the created netlist “make –f system.make libs” generates the software libraries “make –f system.make bits” generates the unmodified bitstream

Distributed model, streaming example 3) In the project root folder, ./xps_streaming needs to be changed to ./xps

19

4) Execute the Python script to generate the final bitstream: “./compile.py ./streaming.cfg” Find the generated bitstream in ./bit.

Shared memory model example 3) In the project root folder, the system is built in ./xps 4) Execute the Python script to generate the final bitsteram: “./compile.py ./compile.cfg” Find the generated bitstream in ./bit.

Tips for adding additional ranks:  The NetIf channels must be connected in the correct order, the channel number correspond to the NetIf number it connects with.  The order that the new core is defined in the ./*.cfg file is the order or ranks (0 being the first item, 1 next, 2 follows, etc.)  The new rank definition must be added in ./rt/rt_m0_b0_f0.mem file, new rank number is added in the beginning since the routing table is defined in the reverse order.  The number of routing table words must increase by 2 every time a new rank is added, this parameter called C_NUM_OF_WORDS can be found in ./xps/system.mhs  The new rank should always be initialized by Rank 0 to ensure the correct order of operation to prevent racing conditions.

Dips Switches: The dips switches at the bottom right of the Vertex-5 board are used for both debugging and system settings.

SW_Pin1 & SW_Pin2: Used for Uart Mux, supports up to 4 Microblazes

SW_Pin3 & SW_Pin4: Used only in the shared memory model, must be set before the system initializes.

SW_Pin[4:3] Codec Effect 0 no modifications to the video 1 introduces dotted pattern to the video 2 causes colour inversion to the video 3 divides the screen in four sections and the following codec are applied: dotted

20

SW_Pin 5-8: The four bit control signal for the debug mux, signals from different cores are displayed on the GPIO LEDs based on this value:

SW_Pin[8:5] Signal displayed on LEDs 0 vga_in_to_fsl_0_o_DBG_H_CNT 1 vga_in_to_fsl_0_o_DBG_V_CNT 2 vga_in_fsl_to_mpe_0_o_DBG_CS 3 vga_in_fsl_to_mpe_0_o_DBG_SEND_CNT 4 vga_mpe_to_ram_0_o_DBG_RECV_CS 5 vga_mpe_to_ram_0_o_DBG_RECV_CNT 6 vga_mpe_to_ram_0_o_DBG_PLB_CS 7 vga_mpe_to_ram_0_o_DBG_WRITE_FRAME_CNT 8 vga_in_analyze_0_o_DBG_H_SYNC_WIDTH 9 vga_in_to_fsl_0_o_DBG_V_SYNC_WIDTH

Rank 0 – The Control Microblaze: This software process is responsible for initializing the system and directing the traffic for the codec processes.

Parameter Description BASEADDR The base address defined for the video frame to be processed HIGHADDR The high address defined for the video frame to be processed TFT_ADDR The starting address of the DVI-Out memory space TMR_ADDR The address of the xps_timer counter GPIO_DIP_SW The address to read SW_Pin[4:3] FPS_DISPLAY The enable signal for displaying the FPS information DEBUG_LEVEL The debugging level, see the source code for more details DEBUG_REPS The number of repetitions for certain debugging stages

Other Tips:  Null-modem setting RS-232 cable is required for the Vertex-5 board  System infrastructure can be updated purely with system.mhs and system.mss files.  Initializing the DVI can sometimes fail if the port is plugged in, simply remove it and plug it back after the configurations are complete.

21

Appendix C

File Structure: The design is located in /work/zhoutony/video_proc_Microblaze/

File/Folder Description EDK-XUPV5-LX110T- Pack XUP Virtex-5 board definitions arches-mpi TMD-MPI software library compile.py Script to compile source code compile.cfg Configuration file for shared mem model streaming.cfg Configuration file for distributed mem model doc Documentations folder. sim_scripts Scripts and files necessary for simulation xps Shared memory model project files xps_streaming Distributed memory streaming project files src/mb0_streaming.c Rank 0 code for distributed mem model src/mb1_streaming.c Rank 3 code for distributed mem model src/mb0_multi_main.c Rank 0 code for shared mem model src/mbx_multi_main.c Rank 3-N code for shared mem model

22

23