Selecting Memory Controllers for DSP Systems by Deepak Shankar, Mirabilis Design

VIRTUAL PROTOTYPING Selecting memory controllers for DSP systems By Deepak Shankar, Mirabilis Design DSP systems often include multiple embedded processors and hardware accelerators. The performance of these systems is typically limited by factors such as I/O bandwidth, memory dis- tribution, and memory speed. This is particularly true when the system components share a memory interface. For such systems, it is critical to choose the right memory controller. Different memory controllers offer latency distributions that make them suitable for spe- cific applications. For example, a slot-based controller with fixed priorities can offer deterministic latencies, while buses such as PCI-Express and CoreConnect offer lower latency at peak loading but higher average latency. Figure 1: VisualSim model of the Xilinx Virtex 4 FX Platform: e405 with MPMC Memory Controller. It is extremely difficult to predict the performance of the memory system and the effect of contention without a solid model of the system. It is therefore important to invest in modelling before beginning development. This modelling should include allocating of threads/tasks to resources, identifying any custom hardware needs, and determining the size and speed of the I/O. The modelling can performed using a number of methods, including “back of the nap- kin” calculations, spreadsheet analysis, or by building a physical prototype. In this article, we examine a unique “virtual prototyping” approach to modelling. We use this approach to model a MPEG II application in a Xilinx FPGA. Figure 2: VisualSim model of the Xilinx Virtex 4 FX Platform: e405 with CoreConnect Bus Model. We evaluate two memory access schemes for this appli- Virtual Prototyping graphical platform for analyzing sors, memory controllers, DMA, cation: the MPMC Memory The virtual prototyping approach the performance of hardware buses, switches and I/Os. Using Controller from Xilinx, and the discussed in the article is based and software systems. It is based this library of building blocks, a CoreConnect Bus specification on the VisualSim solution from on a library of parameterized designer can construct a speci- for FPGAs. Mirabilis Design. VisualSim is a components including proces- fication-level model of a system EE Times-India | eetindia.com 1 We evaluate two memory access schemes for this application: The Multi-Port Multi-Channel (MPMC) Memory Controller and the CoreConnect Bus. The MPMC Memory Controller (Figure 1) is a popular option for this type of applica- Figure 3: MPMC Memory Controller Arbitration Algorithm. tion, as it provides an extremely effi cient means of interfacing the processor and key high speed devices to SDRAM. The CoreConnect Bus (Figure 2) is another popular option. It supports multiple masters, including the PowerPC cach- es and key high speed devices, connected via a slave port to the SDRAM. The preferred technique depends on which approach will provide the best performance in terms of throughput, latency, and processor effi ciency. To investigate the similarities and diff erences between the two approaches, we constructed models of both confi gurations using VisualSim graphical modelling environment. The exploration models did not require any soft- Figure 4: MPMC Memory Controller Activity with SDRAM at top. ware coding. The designer only needs to connect the diff erent modelling elements, create the right traffi c mix from the processor and high speed devices, and select parameter settings that match the anticipated design. Design Considerations The MPMC Memory Controller and CoreConnect Bus are explored by using the same confi guration. The instruction and data channels of the PowerPC e405, Ethernet interface and PCI interface are connected as Masters. The DDR2 SDRAM is a Slave. The Masters and Slaves were all maintained at the similar speed and size characteris- tics in both the models. The design considerations in this study are: Figure 5: CoreConnect Bus Activity with SDRAM at bottom. • What is driving the design—overall performance or containing multiple processors, input scenarios, data rates, priori- a combination of power and memories, sensors and buses. ties, speed, or size. By analyzing Project Overview performance? Model construction is a process of the simulation results, the de- This project evaluates the per- • Do I need one or two e405 connecting icons that represent signer can choose the solution formance of a MPEG II algorithm PowerPCs for my core tasks? the IP in a graphical editor, as il- that achieves the required latency implemented in C. The target • If the e405 PowerPC is run- lustrated in Figure 1. with the lowest power, smallest system is a Virtex-4 FX with up ning at 400 MHz, then what Simulations can be explored fabric confi guration, and highest to two hardwired PowerPC e405 might be the best MPMC or by varying parameters for the system throughput. cores connected to DDR2 SDRAM. CoreConnect Bus clock ratio? 2 eetindia.com | EE Times-India Application software: There are a total of 33 unique DSP tasks in the application. The tasks run- ning on the processor include: DFT, CS_Weighting, IR, and Q_Taylor_Weighting. These tasks are executed multiple times, and each loop has variable counts based on the size, depth and res- olution of the incoming image. MPMC Memory Controller The MPMC Memory Controller arbitrates to SDRAM in a predictable fashion. As shown in Figure 3, the controller algorithm gives preference to the Processor Instruction (P0) and Data (P1) cache accesses for certain cycles, and secondary preference for application access ports 2 (P2) and 3 (P3) if these slots are already in use. Figure 4 illustrates the access patterns observed for the MPMC Memory Controller. CoreConnect Bus: The CoreConnect Bus arbitrates from the master to slave port using read, write, address and request channels. The slave port was con- Table 1: Instruction and Data Cache Statistics. nected directly to the SDRAM, and • Does my design have equal of the design. For more details on this benchmark. The co-processor the bus arbitration provided high read and write memory activ- this design exploration or to view was needed due to the vector- speed burst access to the SDRAM. ity, more reads than writes, or the detailed modelling effort, oriented, heavy DSP nature of the The read channel has a capacity more writes than reads? please contact the authors. workload. The APU ran at 400MHz of 4 bursts and the write channel • What is the effect of the to match the PowerPC pipeline has a capacity of 2 bursts. For this SDRAM speed on the proces- System Setup speed. The PowerPC pipeline benchmark, we used bursts of 16 sor/memory controller/bus The analysis was using the em- handled the instruction and data bytes or two memory accesses of clock ratio? bedded solution on a Xilinx Virtex pre-fetch, and the APU simply ex- 8 bytes each. Figure 5 illustrates • What is the effect of the 4 FX. The system was modelled as ecuted the decoded instruction. the access patterns observed for SDRAM speed on added ap- follows. Memory access from the APU was the CoreConnect Bus. plication access to memory? performed through PowerPC bus PowerPC: interfaces. Analysis These design considerations The PowerPC e405 core executed DMA: The PowerPC accessed The simulation ran on a 1.6 GHz overlap and are interdependent, at 400MHz and the external memory using a direct memory Microsoft Windows XP (SP2 and making analysis complicated. SDRAM ran at 200MHz. For this call, while the PCI and Ethernet Standard Edition) machine with The ability to model concurrency analysis, we used a 200MHz DDR2 accessed memory using DMA. 512 Mb of cache. We simulated (that is, simultaneous events) in memory controller to interface to (The Ethernet and PCI have a 3.0ms of system execution. This a deterministic manner within the SDRAM. The PCI received data lightweight DMA engine with a simulation took 67 seconds to VisualSim allows for a consistent at 33 MHz while the Ethernet ran dedicated channel.) This was es- complete. The model was con- comparison of the MPMC Memory at 100Mbps. The PCI stimulus was sential to fragment the incoming structed in about 4 days using Controller and CoreConnect bus received at constant rate while the data and to prevent the network standard elements in the VisualSim configurations. Ethernet received data every 2 ms traffic from causing a bottleneck library. Our exploration looked at pro- in bursts of 4 packets of 256 bytes. at the memory controller. To We simulated both 200 MHz cessor stalls, cache hit-ratios, I/O The external SDRAM was 1GB and minimize the impact of instruc- and 400 MHz clock rates for the throughput and latency per task. had 32 transaction-level FIFO buf- tion cache misses, software CoreConnect Bus and the MPMC We modified a combination of fers. processing was triggered after 4 Memory Controller. (All other clock speed, controller width and bursts of data from the Ethernet model parameters were held data loading rates to study the Floating-point co-processor were saved in the SDRAM and constant.) Using the 200MHz performance. This article discuss- We used the Xilinx APU/FCM the DMA triggered an interrupt MPMC, the end-to-end latency es only the overall performance floating-point co-processor for in the processor. for application was 87.190 us, EE Times-India | eetindia.com 3 We found that the MPMC tio for the instruction cache. The offered uniform latency cycles maximum value is a worst case for cache-to-SDRAM accesses estimate, based on cache access while CoreConnect had more activity with no prefetching or variable timing. The lowest cycle considerations for looping in the count was achieved using the application code. CoreConnect but its average count was significantly higher. Summary Table 1 shows the details of our The MPMC achieves a better results.

Load more