A VLIW Architecture for Executing Scalar/Vector Instructions on Instruction Level Parallelism ARSHIYA KAREEM1, B

ISSN 2319-8885 Vol.03,Issue.30 October-2014, Pages:6051-6058 www.ijsetr.com A VLIW Architecture for Executing Scalar/Vector Instructions on Instruction Level Parallelism ARSHIYA KAREEM1, B. UPENDER2 1PG Scholar, Dept of ECE, Shadan College of Engineering & Technology, Hyderabad, India, Email:cute.arshiya69 @gmail.com. 2Associate Professor, Dept of ECE, Shadan College of Engineering & Technology, Hyderabad, India. Abstract: This paper proposes new processor architecture for accelerating data-parallel applications based on the combination of VLIW and vector processing paradigms. It uses VLIW architecture for processing multiple independent scalar instructions concurrently on parallel execution units. Data parallelism is expressed by vector ISA and processed on the same parallel execution units of the VLIW architecture. The proposed processor, which is called VecLIW, has unified register file of 64x32-bit registers in the decode stage for storing scalar/vector data. VecLIW can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data cache. Four 32-bit results can be written back into VecLIW register file. The complete design of our proposed VecLIW processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5, XC5VLX110T-3FF1136 device. The required numbers of slice registers and LUTs are 3,992 and 14,826 (14,570 for logic and 256 for memory), respectively. The number of LUT-FF pairs used is 17,425, where 13,433 for unused flip-flops, 2,599 for unused LUT, and 1,393 for fully used LUT-FF pairs. Keywords: VLIW Architecture; Vector Processing; Data-Level Parallelism; Unified Data Path; FPGA/VHDL Implementation. I. INTRODUCTION independent operations explicitly indicated by the One of the most important methods for achieving high instruction [5]. VLIW and superscalar implementations of performance is taking advantage of parallelism. The traditional scalar instruction sets share some characteristics: simplest way to take the advantage of parallelism among multiple execution units and the ability to execute multiple instructions is through pipelining, which overlaps operations simultaneously. However, the parallelism is instruction execution to reduce the total time to complete an explicit in VLIW instructions and must be discovered by instruction sequence (see [2] for more detail). All hardware at run time in superscalar processors. Thus, for processors since about 1985 use the pipelining technique to high performance, VLIW implementations are simpler and improve performance by exploiting instruction-level cheaper than super scalars because of further hardware parallelism (ILP). The instructions can be processed in simplifications. parallel because not every instruction depends on its immediate predecessor. After eliminating data and control However, VLIW architectures require more compiler stalls, the use of pipelining technique can achieve an ideal support. See [6] for more detail VLIW architectures are performance of one clock cycle per operation (CPO). To characterized by instructions that each specify several further improve the performance, the CPO would be independent operations. Thus, VLIW is not CISC decreased to less than one. Obviously, the CPO cannot be instruction, which typically specify several dependent reduced below one if the issue width is only one operation operations. However, VLIW instructions are like RISC per clock cycle. Therefore, multiple-issue scalar processors instructions except that they are longer to allow them to fetch multiple scalar instructions and allow multiple specify multiple independent simple operations. A VLIW operations to issue in a clock cycle. However, vector instruction can be thought of as several RISC instructions processors fetch a single vector instruction (v operations) packed together, where RISC instructions typically specify and issue multiple operations per clock cycle. Statically/ one operation. The explicit encoding of multiple operations dynamically scheduled superscalar processors issue varying into VLIW instruction leads to dramatically reduced numbers of operations per clock cycle and use in-order/out- hardware complexity compared to superscalar. Thus, the of-order execution [3, 4]. Very long instruction word main advantage of VLIW is that the highly parallel (VLIW) processors, in contrast, issue a fixed number of implementation is much simpler and cheaper to build the operations formatted either as one large instruction or as a equivalently concurrent RISC or CISC chips. See [7] for fixed instruction packet with the parallelism among architectural comparison between CISC, RISC, and VLIW. Copyright @ 2014 IJSETR. All rights reserved. ARSHIYA KAREEM, B. UPENDER On multiple execution units, this paper proposes new Generic VLIW Processor in Section II. The VT processor architecture for accelerating data-parallel Architectural Paradigm in Section III .Section IV describes applications by the combination of VLIW and vector the FPGA/VHDL implementation of VecLIW. Finally, processing paradigms. It is based on VLIW architecture for Section V concludes this paper and gives directions for processing multiple scalar instructions concurrently. future work. Moreover, data-level parallelism (DLP) is expressed efficiently using vector instructions and processed on the II. BLOCK DIAGRAM OF GENERIC VLIW same parallel execution units of the VLIW architecture. PROCESSOR Thus, the proposed processor, which is called VecLIW, VLIW architectures offer high performance at a much exploits ILP using VLIW instructions and DLP using lower cost than dynamic out-of-order superscalar vector instructions. processors. By allowing the compiler to directly schedule machine resource usage, the need for expensive instruction The use of vector instruction set architecture (ISA) lead issue logic is obviated. Furthermore, while the enormous to expressing programs in a more concise and efficient way complexity of superscalar issue logic limits the number of (high semantic), encoding parallelism explicitly in each instructions that can be issued simultaneously, VLIW vector instruction, and using simple design techniques machines can be built with a large number of functional (heavy pipelining and functional unit replication) that units allowing a much higher degree of instruction-level achieve high performance at low cost [8, 9]. Thus, vector parallelism (ILP). VLIW instructions indicate several processors remain the most effective way to exploit data- independent operations. Instead of using hardware for parallel applications [10, 11]. Therefore, many vector parallelism, VLIW processors use compiler that generates architectures have been proposed in the literature to the VLIW code to clearly specify parallelism. accelerate data-parallel applications [12]. Commercially, the Cell BE architecture is based on heterogeneous, shared- memory chip multiprocessing with nine processors: Power processor element is optimized for control tasks and the eight synergistic processor elements (SPEs) provide an execution environment optimized for data processing. SPE performs both scalar and data-parallel SIMD execution on a wide data path. NEC Corporation introduced SX-9 processors that run at 3.2 GHz, with eight-way replicated vector pipes, each having two multiply units and two addition units. The peak vector performance of SX-9 processor is 102.4 GFLOPS. For non-vectorized code, there is a scalar processor that runs at half the speed of the vector unit, i.e. 1.6 GHz. Fig.1. Block diagram of generic VLIW implementation To exploit VLIW and vector techniques, Salami and Valero proposed and evaluated adding vector capabilities to In VLIW complexity of hardware is moved to software. a μSIMD-VLIW core to speed-up the execution of the DLP This trade-off has a benefit: only once the complexity is regions, while reducing the fetch bandwidth requirements paid when the compiler is written instead of every time a are introduced a VLIW vector media coprocessor, “vector chip is fabricated. Smaller chip, which leads to increased coprocessor (VCP),” that included three asymmetric profits for the microprocessor vendor and/or cheaper prices execution pipelines with cascaded SIMD ALUs. To for the customers. It‟s easier to deal Complexity with in a improve performance efficiency, they reduced the area ratio software design than in a hardware design. Thus, the chip of the control circuit while increasing the ratio of the may cost less to design, be quicker to design, and may arithmetic circuit. This paper combines VLIW and vector require less debugging, all of which are factors that can processing paradigms to accelerate data-parallel make the design cheaper. Improvements to the compiler can applications. On unified parallel data path, our proposed be made after chips have been fabricated; improvements to VecLIW processes multiple scalar instructions packed in superscalar dispatch hardware require changes to the VLIW and vector instructions by issuing up to four microprocessor, which naturally incurs all the expenses of scalar/vector operations in each cycle. However, it cannot turning a chip design. VLIW instruction format encodes an issue more than one memory operation at a time, which operation for every execution unit. This shows that every loads/stores 128-bit scalar/vector data from/to data cache. instruction will always have

A VLIW Architecture for Executing Scalar/Vector Instructions on Instruction Level Parallelism ARSHIYA KAREEM1, B

How Data Hazards Can Be Removed Effectively

Flynn's Taxonomy

Computer Organization & Architecture Eie

Pipeline and Vector Processing

Computer Organization Structure of a Computer Registers Register

Computer Architectures

UTP Cable Connectors

Assembly Language: IA-X86

Lecture #2 "Computer Systems Big Picture"

Evaluation of Synthesizable CPU Cores

MIPS Architecture with Tomasulo Algorithm [12]

Chapter 1 + Basic Concepts and Computer Evolution Computer Architecture 2 Computer Organization