INSTRUCTION SET EXTENSION USING MICROBLAZE PROCESSOR János Lazányi Budapest University of Technology and Economics Department

INSTRUCTION SET EXTENSION USING MICROBLAZE PROCESSOR

János Lazányi

Budapest University of Technology and Economics Department of Measurement and Information Systems H-1117 Budapest, Magyar tudósok körútja 2. email: [email protected]

ABSTRACT

Instruction set extension is a common way to improve the performance of embedded processors. During my work the Microblaze synthesizable processor was used. It has a special high-speed bus called Fast Simplex Link (FSL) enabling to develop multiprocessor systems and to integrate customized IP cores into the design. In the Xilinx usage scenario the operands of the special instruction are transmitted through the FSL link. To achieve higher data transfer rate the FSL was used only to control the function of the custom IP while the operands required Fig. 1. System architecture proposed by Xilinx. for the operation read directly from the on-chip data memory. so we can fulfill the previously determined strict timing constraints defined for the CPU. With the high level 1. MICROBLAZE PROCESSOR USING FSL LINK language macros we can send data easily through the FSL link superseding to rebuild the full tool-chain (assembler, Microblaze is a 32 bit RISC, Harvard-style soft processor linker, c-compiler, debugger etc.) for each system developed by Xilinx. It can be synthesized in FPGAs with separately. Fig. 1. shows the system architecture proposed the maximum clock speed of 150 MHz. The processor has by Xilinx. large variety of bus interfaces. For fast on chip memory In [3] the author presents a IDCT core, that multiplies access it can use the Local Memory Bus (LMB) which together the input samples with constants read from a table. enables to use internal Block-RAM memory as both The operands can be transferred to the IDCT core in 16 instruction and data storage. The peripherals can be clock cycles. Eight clock cycles are necessary to read from connected through the CoreConnect compatible OnChip the internal data memory to the internal file registers, and 8 Peripheral Bus (OPB). [1] more to transfer it thought the FSL. The multiplication is The processor core can have eight pair of dedicated done in 64 clock cycles. The data is written back in 16 more unidirectional Fast Simplex Link (FSL) channels to enable cycles resulting all together 96 clock cycles. During the high speed point-to-point data transfer among multiple IDCT process the processor can execute some other tasks, processor cores or in some other implementations the and read the data from the custom IDCT core only if processor and peripherals. It has a dedicated bit to necessary. In the described implementation the FSL has a distinguish between data and control information. The FSL FIFO capacity of 16 words, to compensate the speed bus is FIFO based, enabling the implementation of basic difference between the CPU and the IDCT core. communication primitives like: semaphores and pipes. The If we could read the eight input samples directly from maximum transfer speed is 300 million words/sec. The FSL the memory through a Direct Memory Interface the IDCT connection is highly-integrated into the processor instruction would be 32 clock cycles less (33 percent architecture, and it can be programmed through dedicated improvement). We could reduce the system size too, assembly and C language instructions. [2] eliminating FSL FIFO. We can use the FSL link to connect the processor core The described scenario of instruction set extension can with a custom IP block, described in [3]. In this case the be powerful only detaching relatively large algorithms (like custom IP block is attached to the main CPU loosely. This IDCT) from the main CPU, however there is a significant solution has many advantages: We do not change the overhead caused by operand and result transfer between the optimized processor core while adding custom instruction, CPU and the Custom IP core.

We can download the binary representation of the molecules to the BRAM data memory through a system connection (like UART). Each molecule uses 32 memory words. Calculus of the similarity constant S can be initiated through transferring two memory pointers pointing to XA and XB through the FSL, afterwards the result is transferred through the same link. The proposed architecture has a significant benefit by Fig. 2. Proposed system architecture. eliminating the need to transfer all the operands (two times 1024 bits) to the Similarity Search Engine. Other improvement is that the direct BRAM access works at much higher speed compared to the FSL which is limited 2. PROPOSED ARCHITECTURE by the Microblaze processor. The only limitation of the recommended system We can increase the speed of the above described architecture is that, only one custom peripheral unit can be architecture, by eliminating the operand and result transfer connected directly to the data BRAM through the Direct among the CPU and the custom IP block through the FSL, Memory Interface. and using it only to initialize the function. The Microblaze processor can use internal Block RAM as data memory. The scenario described in [2] fetches data 4. CONCLUSION from the BRAM, stores it in one of the 32 internal registers, and passes it to the custom IP block. The recommended system architecture enables to develop a If we connect our custom IP with the dual-ported data fast custom instruction interface for applications where memory directly, through a Direct Memory Interface we large amount of data is necessary for the auxiliary calculus. can access the CPU’s memory directly (as shown in Fig.2), The custom instruction is interfaced to the Microblaze eliminating the temporary internal register store and load processor through a FSL interface, where commands are instructions. transmitted; the numerous operands and results are directly By disabling the caching in the segment of the data accessed from the on-chip BRAM. memory, where the operands and results are stored, we can In our test scenario we were using this architecture to maintain data integrity through the whole system. develop a chemical similarity check engine that calculates two molecules likeness, based on their 1024 bit long binary representation. 3. TEST SCENARIO The improvement of this system is significant compared to the scheme where the custom instruction receives the In some application the custom instruction uses a big operands through the FSL. amount data as operands and/or results. This can be transferred through the FSL or can be read directly from the Data BRAM. During my tests I was using a Chemical 5. REFERENCES Similarity Search [4] engine to evaluate the performance of the proposed scenario. [1] MicroBlaze Processor Reference Guide, Xilinx, 2004 Each molecule has a binary representation of 1024 bit [2] Fast Simplex Link Channel (FSL), Product specifiaction, length. Two molecules A and B are identical if their binary Xilinx, 2004 description X and X is equal. The goal is to find alike A B [3] Hans-Peter Rosinger: Connecting Customized IP to the molecules, where similarity S is defined in the following MicroBlaze Soft Processor Using the Fast Simplex Link way: (FSL) Channel, Xilinx, 2004 [4] Peter Willett, “Chemical Similarity Searching”, J. Chem. = § c · Inf. Comput. Sci., vol 38, pp. 983-996, 1998 S A,B ¨ ¸ (1) © a + b − c ¹

730