VIRTUAL PROTOTYPING Selecting memory controllers for DSP systems By Deepak Shankar, Mirabilis Design

DSP systems often include multiple embedded processors and hardware accelerators. The performance of these systems is typically limited by factors such as I/O bandwidth, memory dis- tribution, and memory speed. This is particularly true when the system components share a memory interface. For such sys- tems, it is critical to choose the right memory controller. Different memory controllers offer latency distributions that make them suitable for spe- cific applications. For example, a slot-based controller with fixed priorities can offer deterministic latencies, while buses such as PCI-Express and CoreConnect of- fer lower latency at peak loading but higher average latency. Figure 1: VisualSim model of the Virtex 4 FX Platform: e405 with MPMC Memory Controller. It is extremely difficult to predict the performance of the memory system and the effect of contention without a solid model of the system. It is therefore important to invest in modelling before beginning development. This modelling should include allocating of threads/tasks to resources, identifying any custom hard- ware needs, and determining the size and speed of the I/O. The modelling can performed using a number of methods, including “back of the nap- kin” calculations, spreadsheet analysis, or by building a physical prototype. In this article, we examine a unique “virtual prototyping” approach to modelling. We use this approach to model a MPEG II application in a Xilinx FPGA. Figure 2: VisualSim model of the Xilinx Virtex 4 FX Platform: e405 with CoreConnect Model. We evaluate two memory access schemes for this appli- Virtual Prototyping graphical platform for analyzing sors, memory controllers, DMA, cation: the MPMC Memory The virtual prototyping approach the performance of hardware buses, switches and I/Os. Using Controller from Xilinx, and the discussed in the article is based and software systems. It is based this library of building blocks, a CoreConnect Bus specification on the VisualSim solution from on a library of parameterized designer can construct a speci- for FPGAs. Mirabilis Design. VisualSim is a components including proces- fication-level model of a system

EE Times-India | eetindia.com  We evaluate two memory access schemes for this application: The Multi-Port Multi-Channel (MPMC) Memory Controller and the CoreConnect Bus.

The MPMC Memory Controller (Figure 1) is a popular option for this type of applica- Figure 3: MPMC Memory Controller Arbitration Algorithm. tion, as it provides an extremely effi cient means of interfacing the processor and key high speed de- vices to SDRAM. The CoreConnect Bus (Figure 2) is another popular option. It supports multiple mas- ters, including the PowerPC cach- es and key high speed devices, connected via a slave port to the SDRAM. The preferred technique depends on which approach will provide the best performance in terms of throughput, latency, and processor effi ciency. To investigate the similarities and diff erences between the two approaches, we constructed models of both confi gurations using VisualSim graphical model- ling environment. The exploration models did not require any soft- Figure 4: MPMC Memory Controller Activity with SDRAM at top. ware coding. The designer only needs to connect the diff erent modelling elements, create the right traffi c mix from the proces- sor and high speed devices, and select parameter settings that match the anticipated design.

Design Considerations The MPMC Memory Controller and CoreConnect Bus are explored by using the same confi guration. The instruction and data channels of the PowerPC e405, interface and PCI interface are connected as Masters. The DDR2 SDRAM is a Slave. The Masters and Slaves were all maintained at the similar speed and size characteris- tics in both the models. The design considerations in this study are:

Figure 5: CoreConnect Bus Activity with SDRAM at bottom. • What is driving the de- sign—overall performance or containing multiple processors, input scenarios, data rates, priori- a combination of power and memories, sensors and buses. ties, speed, or size. By analyzing Project Overview performance? Model construction is a process of the simulation results, the de- This project evaluates the per- • Do I need one or two e405 connecting icons that represent signer can choose the solution formance of a MPEG II algorithm for my core tasks? the IP in a graphical editor, as il- that achieves the required latency implemented in C. The target • If the e405 PowerPC is run- lustrated in Figure 1. with the lowest power, smallest system is a Virtex-4 FX with up ning at 400 MHz, then what Simulations can be explored fabric confi guration, and highest to two hardwired PowerPC e405 might be the best MPMC or by varying parameters for the system throughput. cores connected to DDR2 SDRAM. CoreConnect Bus clock ratio?

2 eetindia.com | EE Times-India Application software: There are a total of 33 unique DSP tasks in the application. The tasks run- ning on the processor include: DFT, CS_Weighting, IR, and Q_Taylor_Weighting. These tasks are executed multiple times, and each loop has variable counts based on the size, depth and res- olution of the incoming image.

MPMC Memory Controller The MPMC Memory Controller ar- bitrates to SDRAM in a predictable fashion. As shown in Figure 3, the controller algorithm gives prefer- ence to the Processor Instruction (P0) and Data (P1) cache accesses for certain cycles, and secondary preference for application access ports 2 (P2) and 3 (P3) if these slots are already in use. Figure 4 illustrates the access patterns observed for the MPMC Memory Controller.

CoreConnect Bus: The CoreConnect Bus arbitrates from the master to slave port using read, write, address and request channels. The slave port was con- Table 1: Instruction and Data Cache Statistics. nected directly to the SDRAM, and • Does my design have equal of the design. For more details on this benchmark. The co-processor the bus arbitration provided high read and write memory activ- this design exploration or to view was needed due to the vector- speed burst access to the SDRAM. ity, more reads than writes, or the detailed modelling effort, oriented, heavy DSP nature of the The read channel has a capacity more writes than reads? please contact the authors. workload. The APU ran at 400MHz of 4 bursts and the write channel • What is the effect of the to match the PowerPC pipeline has a capacity of 2 bursts. For this SDRAM speed on the proces- System Setup speed. The PowerPC pipeline benchmark, we used bursts of 16 sor/memory controller/bus The analysis was using the em- handled the instruction and data bytes or two memory accesses of clock ratio? bedded solution on a Xilinx Virtex pre-fetch, and the APU simply ex- 8 bytes each. Figure 5 illustrates • What is the effect of the 4 FX. The system was modelled as ecuted the decoded instruction. the access patterns observed for SDRAM speed on added ap- follows. Memory access from the APU was the CoreConnect Bus. plication access to memory? performed through PowerPC bus PowerPC: interfaces. Analysis These design considerations The PowerPC e405 core executed DMA: The PowerPC accessed The simulation ran on a 1.6 GHz overlap and are interdependent, at 400MHz and the external memory using a direct memory Microsoft Windows XP (SP2 and making analysis complicated. SDRAM ran at 200MHz. For this call, while the PCI and Ethernet Standard Edition) machine with The ability to model concurrency analysis, we used a 200MHz DDR2 accessed memory using DMA. 512 Mb of cache. We simulated (that is, simultaneous events) in memory controller to interface to (The Ethernet and PCI have a 3.0ms of system execution. This a deterministic manner within the SDRAM. The PCI received data lightweight DMA engine with a simulation took 67 seconds to VisualSim allows for a consistent at 33 MHz while the Ethernet ran dedicated channel.) This was es- complete. The model was con- comparison of the MPMC Memory at 100Mbps. The PCI stimulus was sential to fragment the incoming structed in about 4 days using Controller and CoreConnect bus received at constant rate while the data and to prevent the network standard elements in the VisualSim configurations. Ethernet received data every 2 ms traffic from causing a bottleneck library. Our exploration looked at pro- in bursts of 4 packets of 256 bytes. at the memory controller. To We simulated both 200 MHz cessor stalls, cache hit-ratios, I/O The external SDRAM was 1GB and minimize the impact of instruc- and 400 MHz clock rates for the throughput and latency per task. had 32 transaction-level FIFO buf- tion cache misses, software CoreConnect Bus and the MPMC We modified a combination of fers. processing was triggered after 4 Memory Controller. (All other clock speed, controller width and bursts of data from the Ethernet model parameters were held data loading rates to study the Floating-point co-processor were saved in the SDRAM and constant.) Using the 200MHz performance. This article discuss- We used the Xilinx APU/FCM the DMA triggered an interrupt MPMC, the end-to-end latency es only the overall performance floating-point co-processor for in the processor. for application was 87.190 us,

EE Times-India | eetindia.com  We found that the MPMC tio for the instruction cache. The offered uniform latency cycles maximum value is a worst case for cache-to-SDRAM accesses estimate, based on cache access while CoreConnect had more activity with no prefetching or variable timing. The lowest cycle considerations for looping in the count was achieved using the application code. CoreConnect but its average count was significantly higher. Summary Table 1 shows the details of our The MPMC achieves a better results. SDRAM latency at almost half the For 25 of the 33 tasks, the speed of the CoreConnect bus. MPMC Memory Controller Moreover, the uniform cycle count finished its tasks faster than offered scalability in places where the CoreConnect Bus. Some the network traffic was more tasks took significantly longer volatile. Our recommendation is time when running with the to use MPMC if the lowest latency CoreConnect bus. Table 2 shows is required at the lowest clock the details of our results. rate and lowest power. In these As shown in Table 3, the mean two models one can see that the processor stall for the application MPMC Memory Controller does using both techniques was about a better job of arbitrating cache 2.66%. While the average proces- requests, resulting in better hit sor stall was 1.99% for the MPMC, ratios and throughput for similar it was 1.96% for the CoreConnect system conditions. Virtual proto- model. Unfortunately, the types and performance modelling CoreConnect had a peak stall of allowed us to explore a number 19.64% and this caused the ex- of scenarios and clock configura- cessive latency for certain tasks. tions. The statistics generated This indicated that when the provided us full visibility into the CoreConnect was stalled, the im- operation of the FPGA design. pact on the application latency Moreover the use of the fully vali- was higher than with the MPMC. dated VisualSim FPGA Modelling This also confirmed our finding Toolkit components allowed us to that the CoreConnect had a high- save time without compromising er overall latency and processor overall accuracy. The toolkit has stalling percentage. Increasing advanced simulation technology the network traffic did not cause for FPGA system to make relative any noticeable increase in the comparisons between different MPMC latency, but external traf- configurations. fic increases did cause the overall Table 2: Task latency in cycles for key tasks in the application benchmark software. CoreConnect latency to increase. References IBM 64-Bit Processor Local Bus, The MPMC Memory Architecture Specifications, Controller had a significantly bet- Version 3.5, SA-14-2534-01 ter mean hit ratio for the instruc- FY 2001 tion cache—94% vs. 18.84% for the Core Connect Bus. At certain Xilinx PLB Block RAM (BRAM) times during the simulation, Interface Controller (v1.00b), CoreConnect did get to 94% hit- DS420 October 25, 2005 ratio but the duration was very Xilinx OPB Block RAM (BRAM) short. This indicated that the Interface Controller, DS468 MPMC arbitrated SDRAM requests December 1, 2005 better than the CoreConnect Bus. Table 3: Latency and Stall Activity Statistics. Looking at the number of Xilinx MCH_OPB Synchronous threads (33) and the thread size DRAM (SDRAM) Controller while the 400MHz CoreConnect below will be based on these (1.8 KB on average), we see that (v1.00a), DS492 April 4, 2005 achieved a latency of 86.42 us. clock rates. We found that lower increasing the PowerPC caches to Both numbers met our real-time CoreConnect clock rates did not 64KB each would reduce SDRAM threshold. All the discussion provide adequate performance. latency and achieve 100% hit-ra-

 eetindia.com | EE Times-India