PROTEUS JOURNAL ISSN/eISSN: 0889-6348

Survey on High Performance Reconfigurable Soft - Core for SIMD Applications

Maheswari. R Pattabiraman. V Kanagaraj Venusamy Associate Professor Professor Lecturer SCOPE, Vellore Institute of SCOPE, Vellore Institute of University of Technology and Technology, Chennai, Tamilnadu, Technology, Chennai, Tamilnadu, Applied Sciences - AlMusanaah, India. India. Muscat, Sultanate of Oman [email protected] [email protected] [email protected]

ABSTRACT—The prospective need of SIMD (Single requirement in most of the modern embedded systems. Instruction and Multiple Data) applications like video and The modern processor comes in a wide spectrum of image processing systems requires a processor with greater architecture ranging from hard-core to soft-core processor flexibility and computation to deliver high quality real time with CISC (Complex Instruction Set) and RISC (Reduced output. The main goal of work is to offer a wider survey over high performance processor through various reconfigurable Instruction Set) architecture. The general purpose hard- techniques targeting on real time SIMD dataset. The real- core processor like , Motorola 68000, etc., falls on its time multimedia streaming with extensive parallel data one side and reconfigurable FPGA based soft-core demands high computational processors with low power processor like MicroBlaze, SPARC, etc., falls on the other consumption. The modern processors with flexible computation supports on-the-fly system redesign with side of the spectrum. Each core contains various ranges of reconfiguration techniques at reduced cost. It also functional sub-blocks implemented either in FPGA (Field- accommodates new functionalities sustaining the increasing Programmable ) or in an ASIC (Application demand for processor with less NREs (Non-Recurring Specific Integrated Circuits). The major advantage of Engineering) cost and shorter time-to-market. The processors synthesizing the processor in FPGA rather than ASIC is with flexible platform permit the designer to incorporate the demanding changes in the existing standards of any that it provides re-configurability when altering the design application like wireless , telecommunication with no significant cost [1]. etc. Adaptive computing is the new processor paradigm evolved in early 90s to bridge the space which exists between A.MOTIVATION the generic processor and application specific processor. It In this embedded era, 90% of applications (mobile phones, also supports the inclusion of specific hardware accelerator in the existing RISC (Reduced Instruction Set ) based video surveillance systems, medical equipment like architecture to improve the performance of particular medical imaging systems, patient monitoring systems) application. In the last decade, processor based embedded target high performance computing at low power systems like SoC (System–on–Chip) have become more consumption. To accomplish this multimedia data prevalent due to their high performance at low power. Many embedded application designers like mobile and smart phone processing, there is a huge demand for high performance developers namely Philips N-Experia, Intel PXA etc., and low power processor. The two major solutions to explored SoC as computational devices. Thus this work improve the performance of the processor are through summarizes a literature survey of all the related works such hardware and . Hardware solutions always ensure as RC () , HPRC (High high performance and low power than software solutions, Performance Reconfigurable Computing), FPGA based processors, open source soft-core processor, reconfigurable but with very less flexibility to reconfigurability. Here architecture, and SIMD. The detailed comes the need for processor architecture, literature survey analyzing the impact of current techniques targeting processor performance is illustrated. Which has to balance between the power as well as performance of multimedia dataset. There exists a KEYWORDS—Reconfigurable Computing, High Performance significant performance gap between commercial SCP like Computing, FPGA, Soft-core Processor MicroBlaze (), Nios () and open source SCP

like OR1200, SPARC. As an example MicroBlaze and I . INTRODUCTION Nios have 31-37% lower run times than OR1200 (Gartner Modern embedded applications are ironic in Dataquest). To fill this gap, there is a need for multimedia data involving audio and video processing. reconfigurable high performance soft-core processor This real time streaming with extensive parallel data architecture. To support DLP and ILP in multimedia demands for high computational devices with low power applications, many techniques such as SIMD, SMP consumption. These high performance processors work as (Symmetric Multi-Processor), VLIW (Very Long the driving tool exploring the resource and design relevant 1 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 713 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

Instruction Word), super-scalar instruction sets are being utilized in the processor. Out of these techniques, this

work focused on the wider study on SIMD instruction set along with super-scalar architecture as it provides greater flexibility in the reconfigurable architecture with improved

performance.

B.Processor

Processor architecture works as fundamental Fig.1. when exploring the resource and design relevant Impact of embedded systems in various application domains requirement in most of the modern embedded systems. As a result, most research investigations aim to achieve good D.FPGA improvement in processor performance at reasonably low An FPGA is a device with large collection of power dissipation. The hard-core processors provide high programmable logic cells interconnected through a performance for real-time embedded applications, but they network. The high density interconnections across the are characterized by non-reconfigurability feature. Fig.1 logic cells provides higher flexibility to implement any represents the trade-off between flexibility and computational model at low power consumption [2]. Since performance of processors based on its implementation. It these logic cells are independent blocks, exploiting is implicit that General Purpose Processor (GPP) proves to temporal and spatial parallelism is very efficient, i.e. it provide more flexibility adapting itself to configure in allows many processing elements to execute concurrently. many fields, but with degradation in performance. An They are typically described by register transfer level Application Specific Processor gives better performance (RTL) abstractions using an HDL (Hardware Description than GPP with optimized instruction set. Whereas ASIC Language). The building blocks of any FPGA are CLB gives higher performance than other implementation, (Configurable ), input and output blocks and altering its feature after fabrication is not feasible one. interconnection network. Through direct interconnect, all Thus reconfigurable processor lies in between the GPP and the adjacent CLBs are interconnected whereas through ASIC provides greater flexibility with better performance. programmable switching matrix (PSM), the horizontal and Almost all digital circuits are classically partitioned into vertical CLBs are intercomnected. various functional modules called as cores. Each core

contains various ranges of functional sub-blocks implemented either in FPGA or in an ASIC.

C .EMBEDDED SYSTEMS

Embedded system is a computational system with dedicated function to perform specific job. It integrates three components into functional units such as: • Hardware specific for the Application• Embedded software • RTOS (Real Time Operating Systems). Most of the functionality

of embedded systems will be implemented using software. Unlike GPP (General Purpose Processor), embedded systems demand variety of hardware design to quench the Figure 2: FPGA functional view need of various applications. The major challenge in The internal architecture of FPGA is shown in Fig. 2. Each designing an is that it needs to work and slice of FPGA contains flip-, and sustain in a diverse environment to handle various types of collections of digital logic gates called LUTs (Look-Up tasks and applications. Thus, the embedded designer needs Table). The LUTs are alsocalled functional generator of to address various issues such as high-computation, low FPGA. These FPGAs are reprogrammed dynamically power, reliability, flexibility, predictability and memory during compile time or execution time using virtual management. The occupancy of embedded systems in hardware. The CLB functional view of FPGA is shown in various fields is given in Fig 1. Fig.3. They are typically described by register transfer level (RTL) abstractions using HDL.

2 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 714 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

Fig.3. CLB functional view The FPGA 7 allows the user to program the hardware even electronics, video and image processing, etc. were drifting during post-fabrication. In reality, FPGA exploits dynamic to the FPGA. reconfigurable fabric design methods to improve Fig.4.FPGA with Embedded Processor and without Embedded Processor performance of the targeted architecture (Wang et al. Survey (2009)). The major benefit of synthesizing the design in D.Embedded Processors FPGA is that it provides more flexibility when altering the The high performance processors work as a driving tool design with no significant cost. Redesigning the system for exploring the resources and design relevant using ASIC leads to heavy overhead, especially when it is ready for production. Major advantages of an FPGA with requirements in most of the modern embedded systems [3]. These modern processors are represented in a wide an embedded processor are as follows: • It merges spectrum of architecture ranging from hard-core to soft- processor and I/O functions onto a single board. • It core processor with CISC (Complex Instruction Set supports flexible design pattern for the application Computer) and RISC architecture [4]. The general purpose requirement with optimized power. • They mostly come in hard-core processors like Intel and Motorola 68000 fall on tightly coupled high speed logic and control system interfaced on a single chip. FPGA supports one side and reconfigurable FPGA (Field-Programmable Gate Array) based soft-core processors like MicroBlaze hardware/software co-design such that the and and SPARC (Scalable Processor Architecture) fall on other logics are programmed through software and implemented in hardware architecture (Moreno et al. (2010)). It side of the spectrum. Each soft-core contains various ranges of functional sub-blocks implemented either in efficiently exploits temporal and spatial parallelism which FPGA or in an ASIC (Application Specific Integrated allows logic cells to integrate many PEs executing concurrently. Many computationally intensive techniques Circuits). The major advantage of synthesizing the processor in FPGA rather than ASIC is that it provides re- such as high performance computing, video surveillance, configurability when altering the design with no medical electronics, video and image processing are drifting towards the FPGA. The occupancy of FPGA with significant cost [5]. embedded processor is prevalent from last decade as The embedded processors are realized in three different shown in Figure 1.5. The FPGA allows the user to ways such as: • Hardware processor • Software- program the hardware even during post fabrication. In programmed processor • Reconfigurable processor reality, FPGA exploits dynamic reconfigurable fabric a. Hardware processor: design methods to improve performance of the targeted architecture. The major benefit of synthesizing the design Processor realization using digital circuits for any specific in FPGA is that it provides more flexibility when altering application is called as hardware processor, also known as the design with no significant cost. Whereas redesigning Application Specific Integrated Circuits. Advantages: It the system using ASIC leads to heavy overhead, especially provides very high efficiency and performance. when it is ready for production. The occupancy of an Disadvantages: The functional units are not flexible FPGA with embedded processor is prevalent from last (cannot be altered after fabrication) and they are expensive decade as shown in Fig.4 depicting the survey of FPGA [6]. with Embedded Processor and without Embedded b. Software-programmed processor: Processor. Major advantages of an FPGA with embedded processor are as follows:It merges processor and I/O Processors designed through coding are often referred as functions onto a single board.It supports flexible design Software- programmed processors. Advantages: The pattern for the application requirement with optimized functional units are very flexible to change and they are power.They most comes in tightly coupled high speed inexpensive. Disadvantages: Low performance and works logic and control system interface on a single chip Works with fixed instruction set by hardware [7]. tradeoff between hardware and software codesign.Many d.Reconfigurable processor: computational intensive techniques such as high Reconfigurable processors fill the gap which exists performance computing, video surveillance, medical between hardware processors and software processors

3 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 715 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

implementation. The major advantages of reconfigurable They are offered by Altera, Tensilica, Xilinx. e.g. Nios, processor is that it provides higher level of flexibility than MicroBlaze, Xtensa. Table 1.1 gives the comparative hardware processor and much higher performance than features of various commercial and open source SCP [11]. software programmed processor. The disadvantages of the Table 1.1: Commercial versus open source SCP Soft-core reconfigurable processor is that the programmable routing Commercial Open Source NiosMicroBlazeXtensa XL time across the logic blocks and interconnections of an OR1200 Leon Speed MHz 200 MHz 200 MHz 350 MHz FPGA is high. Since 1970s, the design field ofprocessor 300 MHz 125 MHz ASIC/ FPGA / FPGA IC Tech.

architecture has started its development. In late 1990s, the Table 1: Commercial vs Open Source Soft-core Processor emergence of reconfigurable logic in digital logic has evolved. In the 21st century, the introduction of open source and its architecture gains the market [8]. Due to the technological advancement of the semiconductor industry, high performance processor has been designed for real time embedded applications ranging from personal computing to telecommunication systems [9].

E.Intellectual Property Core Any logic block which is reusable with pre-designed components is called Intellectual Property (IP) core. The three forms of IP cores available in markets are: 1. Soft-Core: which are implemented using programmable logic, e.g. OR1200 (OpenRISC1200). 2. Firm-Core: Microprocessors which are implemented using gate-level net-list, e.g. Huawei (N/W routing). 3. Hard-Core: Microprocessors which are implemented in silicon using component layout, e.g. MIPS32@4Kc. d. processor trade-off a.Hard-Core Processors Processor architecture works as fundamental component Hard-Core processors provide high performance for the when exploring the resource and design relevant real-time embedded applications, but without requirement in most of the modern embedded systems. As reconfigurability feature. As they are optimized, it always a result, most research investigations aim to achieve good ensures very high processing speed ranging from 100 MHz improvement in processor performance at reasonably low to more than 1 GHz. The de-merit of hard-core processor power dissipation. The hard-core processors provide high is that functional units are fixed and cannot be modified performance for real-time embedded applications, but they after fabrication e.g. Intel Atom Processor, Altera Cyclone are characterized by non-reconfigurability feature. Figure 2 V SoC. represents the trade-off between flexibility and b.Soft-Core Processors performance of processors based on its implementation. It is implicit that GPP provides more flexibility adapting A soft-core is a traditional processor architecture itself to configure in many fields, but with degradation in fabricated in FPGA, which allows architect to mount any performance [12]. An application specific processor gives existing software integrated with reconfigurable hardware. better performance than GPP with optimized instruction The soft-core with FPGA based SoC is easier to set, whereas ASIC gives higher performance than other understand which provides higher level of abstractions. implementation, but altering its feature after fabrication is These cores are platform independent and they are more not a feasible one. The reconfigurable processor lies in flexible allowing user to edit the source code. In short, between the GPP and ASIC and provides greater flexibility they are easily modifiable and tuned to specific with better performance. Almost all digital circuits are requirements [10]. These cores can be realized as multiple classically partitioned into various functional modules cores with more features and custom instructions. The calledcores [13]. Each core contains various ranges of embedded SCP (Soft-Core Processors) comes in two forms functional sub-blocks implemented either in FPGA or in such as: • Open source cores • Commercial cores Open an ASIC. source cores: They are licensed under LGPL (Lesser GNU Public License) and GPL (GNU Public License). e.g. OR1200, Leon, Open Fire.

c.Commercial cores

4 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 716 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

is also known as 9 dynamic reconfiguration [18]. The two types of regions available during reconfigurations are: • Static region: It keeps on operating its own functionality.

Reconfigurable region: It can be reconfigured with a new Fig.5: Flexibility and Performance Trade-off module. Fig.6 show the static and reconfigurable regions Fig.5 represents the trade-off between flexibility and of FPGA. performance of processors based on its implementation. It is implicit that GPP provides more flexibility adapting itself to configure in many fields, but with degradation in performance. An application specific processor gives better performance than GPP with optimized instruction set, whereas ASIC gives higher performance than other implementation, but altering its feature after fabrication is not a feasible one [14]. The reconfigurable processor lies in between the GPP and ASIC and provides greater flexibility with better performance. Almost all digital circuits are classically partitioned into various functional modules called cores. Each core contains various ranges of Fig.6.Static and reconfigurable regions of FPGA functional sub-blocks implemented either in FPGA or in b. module based partial reconfiguration an ASIC [15]. This technique allows the modular portion of the FPGA F.RECONFIGURABLE COMPUTING device to undergo reconfiguration. The main The architecture of Reconfigurable Computing (RC) characteristics of module based partial reconfigurations allows to adapt / modify its structure based on the are: It is a systematic design usually used for larger requirement of specific application/ domain after modulesIt ensures communication between the modules fabrication. It allows configuration at different levels like through tri-state buffers available in the units. It has a high architecture level, its planlevel, flow of tool, CAD, reconfiguration cost.It supports area constrained placement alteration in languages and modification at and routing . The main drawback of this module level. The computing with reconfigurable logic like based PR is that it does not provide cent-percent hardware FPGAs are configured to implement arbitrary logic and support to guarantee successful reconfiguration. Due to used for embedded systems and high-performance this, the latest devices do not support module based PR computing. The primary goal of any reconfigurable c. difference based partial reconfiguration computing is to provide concurrency/parallelism using an existing algorithm technique. The reconfigurable This type of reconfiguration technique is mostly computing also ensures parallelism through hardware [16]. appropriate for implementing very small design, which In reconfigurable computing, the programmable bit-file is support on the fly reconfiguration. It basically compares downloaded into FPGA which is made up of an array of the circuit implementation of two different designs and PLBs (Programmable Logic Block). These PLBs are then creates partial bit-stream only for those designs which interconnected by the programmable interconnect routing are different. It can be implemented in two different ways resources. The reconfigurable device can be used several like front-end difference based partial reconfiguration and times to design and implement different versions of the back-end difference based partial reconfiguration. Some of end product until an error-free state is reached in the the applications, which use this difference based partial design of systems. The system level architecture of reconfigurations are: • LUT based logic equations for reconfigurable computing falls under five classes: • Stand- changing the parameters of an equation • Parameter alone external processing unit • External interface alteration in filter applications • Memory blocks, which processing unit • Co-processor unit • Reconfigurable requires dynamic data movement The major limitation of functional unit • Processor embedded in reconfigurable difference based partial reconfiguration is that it is very fabric The reconfigurable computing allows the designer inefficient for larger designs [19]. to customize any functional unit, RF, interconnects, d. early access partial reconfiguration number of PEs (Processing Element) and memory elements. It also permits to add extra registers, specialized This early access partial reconfiguration (EARP) approach instructions and new functions [17]. is similar to module based PR but allows partial reconfigurable region of any rectangular size. macros a.partial reconfiguration are used to partition the pin during reconfiguration. In Partial Reconfiguration (PR) a portion of an FPGA is Depending upon the clocking of reconfiguration, these bus reconfigured when the remainder section is in macros are categorized into synchronous bus macro and operationdeprived of any loss of data. The real advantage asynchronous bus macro. In this method, the is visualized when PR is done during processor runtime. It communication between the two reconfigurable regions

5 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 717 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

such as static and partial regions is carried out through

LUT based bus macros [20]. Fig. 7. SIMD operation in gray scale conversion G. PARALLELISM Image processing applications greatly benefited with The major categories of parallelism are i) data level SIMD architecture as it has a single instruction operating parallelism (DLP) ii) instruction level parallelism (ILP) repetitively over whole data streams. and iii) task level parallelism (TLP). Parallelism in DLP b.MISD involves the process of distributing the same dataset into various processing elements to execute the same The MISD paradigm asserts higher parallelism than SISD, instruction simultaneously. In ILP, the various instructions executing multiple instructions over single data using more are executed simultaneously in different functional units to than one processing unit. It always ensures data reduce the over-all execution time of the processor. TLP synchronization across the processing unit. ensures the division of the entire problem into different implementation is based on MISD. MIMD: This tasks and then spreads into various processing elements to architecture contains multiple PEs to process multiple achieve parallelism. To gain higher 11 performance, instructions over multiple dataset simultaneously compiler will implicitly exploit parallelism for ILP without exploiting task level parallelism. The major challenge in intervention of the designer [21]. However, for DLP and addressing MIMD is the way of accessing updated data TLP, designer needs to crucially investigate the either through shared memory or distributed memory exploitation of hardware level parallelism to achieve systems [24]. This results in an increased requirement of required performance. hardware component with an efficient message passing system to handle multiple data [25]. a. flynn taxonomy c. multi-cores Based on the nature of the data streams and instruction sets, Flynn’s taxonomy has four classifications of According to Moore’s law, rapid advancements in parallelism such as: • Single Instruction Single Data technology allows designers and (SISD) • Single Instruction Multiple Data (SIMD) • researchers to exploit reconfigurability and parallelism on Multiple Instruction Single Data (MISD) • Multiple a single . Intel incorporated multi- concepts on a Instruction Multiple Data (MIMD) SISD: The SISD is the single die through simultaneous multi-threading (SMT) conventional modeling of computing wherein a processing and hyper threading (HT) to execute more than one thread unit executes single instruction on single data. It supports simultaneously [26]. In 2005, Intel introduced dual core sequential execution of instructions over strong processor on a single die as true multi-core processor (Intel dependency dataset. The two major constraints of SISD technical report). In recent days, most of the computational are processor speed and memory speed [23]. SIMD: The systems have dual or quad core on a single die. Another SIMD architecture holds a special which parallel processing technique such as SMP is used to transmits the same instruction to all the processing realize multi-core in architecture. SMP helps to scale elements to execute a single instruction over various data the number of cores in x86 processor providing full access streams. For example, to convert the color image to gray to I/O devices through shared memory. These SMP based scale image, multiple values of RGB (Red Green Blue) as processors are widely used in many functional systems shown in Figure 7 performs same computation over single supported by vendors like Intel, Cyrix, VIA, and AMD. instruction as given in equation (1). The of The overall performance of the system could be increased these data streams are accomplished through NUMA through multi-core by multiplying the processing element (Non-) which takes very less in such a way that each core concurrently executes the time to retrieve the data from local memory. The CPU instructions. The significance of multi-core is that it () with SIMD ensures parallel achieves the speedup, proportional to the number of cores processing by allowing the processor to execute a single through data level parallelism and task level parallelism instruction with multiple dataset without need to fetch and [27]. These multi-cores come in two form such as decode the instruction more than once. homogeneous multi-core and heterogeneous multi-core. The greater part of multi-core CPU architectures are Y = R ∗ 0.29891 + G ∗ 0.58661 + B ∗ 0.11448(1) incorporated as homogeneous core also called as symmetric core in which all the processing elements of the CPU are identical. In heterogeneous multi-cores, each core is designed to execute various tasks of an application. Borkar in 2007[28] states that doubling the functionality of the core will increase its overall performance by 40%. It has been inferred that the processor with greater number of symmetric cores gives better performance rather than an equivalent processor holding as many number of heterogeneous cores

6 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 718 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

H. ARCHITECTURES FOR MULTIMEDIA Molen polymorphic processor on reconfigurable hardware such as Xilinx Virtex FPGA. In 2007 Campi et al.[34] manufacturers have initiated and produced proposed DREAM which is an adaptive DSP (Digital tightly coupled instruction set architecture (ISA) extension Signal Processor) with pipelined reconfigurable with higher flexibility and performance for multimedia to increase the computational density of the targeted dataset [29]. SIMD and multi-core are the two popular device. Mucci et al. (2006) [35] proposedXiRisc solutions used for multimedia applications exploiting (extensible instruction set of RISC) reconfigurable parallelism to achieve high performance of a system. Most processor to increase the performance of a system with of the multimedia dataset support DLP rather ILP. The reduced energy compared to DSP and general purpose ISA extensions supporting SIMD with DLP in multimedia processors. The important milestones of reconfigurable applications are addressed either by hardware solutions or computing are shown in Fig.8. programmable solutions [30]. The various architectures, which support these solutions are homogeneous multi-core architecture and heterogeneous GPU () architecture. Real-time video capturing and processing play a vital role in consumer electronics, media, military, security, scientific, and industrial applications. This real-time video streaming provides extensive parallel data demanding high computation devices with lower Fig.8.Milestones of Reconfigurable Computing power consumption. The various multimedia applications of FPGA based processor are video surveillance, consumer B. Dynamic Reconfiguration electronic products such as mobiles, smart phones, and A favorable feature of an FPGA is the capability to camera. Video surveillance is an active technology to reconfigure the same component of the hardware circuit record the activities of people for security and safety for diverse tasks at various stages of an application during concerns. More resolution with high performance system execution. Furthermore, the functionalities can be is required towitness the stored real-time data for further exchanged on the coastalthoughportion of the system processing. The major research challenges which need to hardware remains in operation, this is termed as dynamic reconfiguration. Foldesy et al. in 2008 [36]presents a be addressed under hardware level across SIMD based universalstructure for determining the reconfiguration time multimedia applications are: • Demand of high from the system perception. They have conveyed a performance in real-time streaming • Low power speedup of 22.8 with the modified structure over the consumption and dissipation • Memory and I/O bandwidth physicaltechnique for gathering the results. overhead • Communication bandwidth overhead • Load

balancing OR1200 APPLICATION OpenRISC is one of the most developing open soft-core with large project and C. High Performance Reconfigurable Computing (HPRC) has strong open-core community support, which finds its HPRC is integration of high performance computing application in many embedded systems. In the year 2003, (HPC) with reconfigurable computing. It potentially Flextronics fabricated OR1200 as standalone ASIC. In permits the users to exploit parallelism to achieve recent days, Samsung used OpenRISC in manufacturing performance speedups through RC element. In HPC, wider their set-top box processor series like SDP-83 B, SDP- range of methods/metrics are available to improvise the 1003, SDP-1006 E series. Its fault-tolerant version is used performance of the systems such as power consumption, in many critical applications like Swedish space, AC processor speedup, memory management, network nodes, Microtec (defense), NXP (JN5148 ultra-low power ZigBee node configurations, processing element, FPGA count, transceiver chip) and NASAs TechEd Sat [31]. FPGA size and type [37]. The three major challenges to overwhelmedby HPRC are i) itsability of II. LITERATURE SURVEY programmingii)performance measures with respect to Memory and iii) its trading between the price vs This work summarizes a literature survey of all the related performance. The performance of the system is increased techniques such as RC, HPRC, and FPGA based through parallel simulations with dedicated hardware processors, open source soft-core processor, synchronization [38]. Recent research on HPRC is focused reconfigurable architecture, parallel computing and SIMD. on flexible FPGA based SoC supported by Altera, , A. Reconfigurable Computing Overview Xilinx allowing the end user to reconfigure the IP core. In order to ensure thesystem increased performance with low The design of fixed plus variable structure was proposed power and higher flexibility in designthe amalgamation of by Gerald Estrin in 1960 (Gerald and Turn (1963)). Due to general purpose CPU along with reconfigurable capability the evolution of microelectronics, in 1984 Xilinx released as like in FPGAs is necessary. In systems like this, new programmable blocks with interconnect logic called multicore processors always offergreat computation rates. FPGA. In early 90s, [32] Razdan et al. proposed In some HPRC design, FPGA assists as co-processor to the enhancement of FPGA with partial reconfiguration and CPU where the system entreats the suitable FPGA design adaptive instruction set in PRISC (Programmable to perform the target operation. In 2012 Radu et al.[39] Instruction Set Architecture). Using an embedded describe the application of CPU-intensive that attempt to PowerPC core, in 2004 Vassiliadis et al. [33] implemented 7 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 719 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

completelyexploit all accessible instances, such that the used in smaller applications [47][48]. Chuan et al. in 2011 scientific work flow application implemented on top of the [49] used PicoBlaze soft-core to optimize the generic framework and they executed GAE (Google App chronological order of execution through dynamic Engine) with improved performance. scheduling algorithm such as early deadline first (EDF) which is very efficient by managing dynamic D.FPGA Based Processors reconfigurable modular task scheduling. Merging hardware and software design components to F. Reconfigurable Architecture work as a single embedded system is challenging. The advancements in FPGA technology allow the designer to In reconfigurable architecture, the speed of the processor create Programmable System-on–Chip (PSoC). It is an i.e both static and dynamic times are accelerated by integrated environment interconnecting the logic blocks of executing the application in the critical part of the FPGA with IP cores, DSP unit, microprocessor units, and hardware. Wang et al. (2009) proposed HERA memory units along with peripheral units. In 2007 Martin (HEterogeneous Reconfigurable Architecture) a et al. [40] proved that computational density in FPGA reconfigurable FPGA system with many PEs in two based processor improves faster for reconfiguration dimensional mesh supporting SIMD and MIMD computing than in a microprocessor. FPGA hardware computational models concurrently. In HERA architecture, always implements the required computation in dedicated the dataflow control is limited as it majorly focused on functional unit replicating across the chip. Thus, the study data-intensive computation. Many researches have been identifies that the FPGA based reconfigurable system conducted on reconfiguration and concluded that gains higher speed than microprocessor due to three prime reconfigurable computing in FPGA is used to hasten many factors like processor intensity, low clock latency and embedded applications [50]. Most of the image processing spatial parallelism. The designer needs to exploit these machines exploit reconfiguration due to its parallelism factors while designing an FPGA reconfigurable system feature. Horta et al. in 2012 [51] incorporate dynamic for the application. reconfiguration with plug-ins and implement on-chip system programmable with structured interconnections for Mattavelli et al. [41] achieved high performance dynamic reconfiguration. The performance of image and computing through FPGA based processor by accelerating video processing was evaluated by Ranganathan et al. HPC with FPGA. It potentially delivers enormous (2010) [52] using general purpose processor and ISA performance with thousand fold parallelism, especially for extension. They also made analysis with ISA extension in low precision computation. Greg et al. in 2011 [42] both in-order and out-of-order processors resulting in proposed an end-to-end FPGA tool flow, which significant reduction in overall execution time. The complements synthesis tools with three main extensions as mixture of multiple issue and out-of-order issue increases a coordination framework, intermediate fabrics and the overall performance by more than 3 times [53]. An performance analysis tools. expandable and high performance -vector E.Open Source Soft-Core Processors multiplication architecture is proposed by generating register- tree with complex structure to increase the As soft-core processors are very flexible towards the performance [54]. Somnath et al. (2011)[55] and requirement of applications, in last decade, FPGA based GwoGiun et al. (2012)[56] presented circuit withsoftware processor seems to be very prevalent [43]. This section co-design for achievingsubstantialenhancement in energy inspects the various open source soft-core processors like efficiency over traditional FPGA. SecretBlaze, LatticeMico32, LEON3, OpenFire competing commercial MicroBlaze. SecretBlaze: SecretBlaze is a 32- a. Parallel Computing bit RISC architecture with a five stage pipeline exploiting In parallel computing, it is essential to partition the instruction level parallelism. The implementation result of computational domains into sub-domains and map them SecretBlaze offers performance equivalent to MicroBlaze onto a set of PEs [57]. In data level parallelism, these sub- SCP, but this core does not intensely optimized for FPGA domains are mapped on to various PEs with very minimal devices [44]. LatticeMico32: LatticeMico32 SCP is a interactions, thereby maximizing the processor utilization Harvard based 32-bit architecture with better performance with minimal inter-processor communication [58]. widely used in networking applications. However, the Nobuaki et al.in 2011 [59] proposed CMA architecture architecture supports only the compatible which consists of a large PE array without memory peripheral interfaces [45]. LEON: LEON is SPARC V8 elements for mapping the applications data-flow. 32-bit, 7-stage pipeline architecture designed by Gaislerin Sudarshan et al. (2009) [60] proposed PARLGRAN 2010 [46]. It supports AMBA (Advanced (Parallel Granularity), a semi on-line scheduling Bus Architecture) an advanced high-performance bus and algorithmic methodologythrough data-parallelism to with on-chip communication. LEON exploit its performance. They support module chain is a structured and reliable architecture widely used in execution with Run Time Reconfiguration(RTR) military and space applications. As LEON uses windowing competence with PPC405 (PowerPC) Embedded register of SPARC, for each procedure call, all the local Processor. Due to hybrid characteristics of FPGA, these registers store their content into memory, which is very RTRs allow hardware reconfiguration during runtime of time consuming. OpenFire: OpenFire SCP is a 32-bit computation reducing the energy and size of the system RISC architecture which acts like a subset of MicroBlaze design [61]. of Xilinx. The performance of OpenFire is more comparable with MicroBlaze. Due to its longer latency G. Multi-Core in Commodity Systems cycle in hardware multiplier and smaller size, OpenFire is 8 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 720 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

Multi-core architectures are prevalent in modern SIMD/MIMD architecture for vision application with high commodity systems, such as Tilera’s with 64 core [62] and performance computation at 7.2 W power dissipation. terascale processor in Intel with 80 cores [63]. In the last

decade, the occupancy of multi-core in the market is widespread, such as /B.E (Cell Broadband Engine) III.CONCLUSION from Sony which proposed eight RISC core in PowerPC By considering the significant features of HPRC, core. Nvidia offers sixteen to thirty PEs in single die. In 2008, Intel implemented Nehalem four-core processor in reconfigurable computing, the overall performance of the Intel-i7 architecture using HT. AMD Barcelona (2007) open source SCP, an FPGA based processor could be offers four-core processor with maximum performance of improved upon targeting SIMD applications. To work on 41.6 GFLOPS (Giga FLoating Operations per Second). extremely long data streams of SIMD, the reconfigurable Chrysos in 2012[64] proposed Xeon Phi a multi-core computing for a soft-core processor could be incorporated architecture in Intel MIC (Many Integrated Circuit) for better performance. The state-of-art analyzing the incorporating 61 cores of simple x86 architecture with impact of various techniques targeting processor 512-bit SIMD unit. IBM supports POWER7 chip with 4, 6, performance improvement and its analysis has been or 8 cores with SMT technique delivering 264.96 GFLOPS illustrated. Though there was a promising performance at 3.0 GHz. Recent FPGAs supports multi-core in SoC, improvement, this study identified that the performance such as the Xilinx series (Zynq) which integrates embedded ARM processor with traditional processor achieved so far meeting the increasing demand of the through logic blocks and interconnects [65]. HPC is the dynamic SIMD execution. This research leads to conclude most beneficial platformexplored due to the advancement for better performance soft-core in a reconfigurable FPGA of multi-core system. To exploit the functionality of 24 based SCP for SIMD extensions with incorporating DLP multi-core to greater extent, the application has to allocate may ensure the performance improvement with less power its computation task for execution across the cores with consumption. proper load balancing. Optimizing and evaluating the performance of multi-core systems becomes bottleneck for REFERENCES both hardware and software engineers. Gundolf et al. in [1] Iturbe X, Khaled Benkrid, Chuan Hong, Ali Ebrahim, Torrego 2015 [66], presents a customizable and highly scalable R, Martinez I, Arslan T and Perez J(2013). R3TOS: A novel eight core FPGA based processor called Paranut for reliable reconfigurable real-time for highly adaptive, efficient, and dependable computing on FPGA, IEEE SIMD, implemented in Virtex- 5 FPGA achieving 6.13 Transaction on Computation. Vol.62, No.8, pp 1542- 1556. CoreMarks/MHZ [2] Zine El Abidine and Ahmed (2009). Self-Partial and H.Analysis of Processors for SIMD Applications Dynamic Reconfiguration Implementation for AES using FPGA, International Journal of Computer Science, Vol. 2, No.2, pp 33-40. Numerous techniques and approaches have been proposed [3] Maheswari.R and Pattabiraman.V (2015). Performance in architecture level to meet the demanding performance, analysis of reconfigurable open soft core processor OR1200 with flexibility and power in SIMD applications [67], [68]. global hazard controller. International Journal of Pure and Applied While designing architecture for SIMD based applications, Mathematics, Vol. 101, No. 5, pp. 849-858. all its demands and the various requirements need to be [4] Hans-Peter Rosinger (2004). Connecting Customized IP to considered [69]. These extensive demands are MicroBlaze Soft Processor Using the Fast Simplex Link Channel, incorporated into the functional unit of the targeted system Xilinx-XAPP529 , Vol.3,No.1, pp.1-12. through reusable and reconfigurable techniques. All these [5] Shimoo Kosei, Yamawaki Akin and Iwane Masahiko (2004). functional units will be built as various IP blocks A New Parallel Processing Paradigm Using A Reconfigurable interconnected in SoC architecture. Modern architectures Computing System, International Symposium-Communication and with super-scalar, pipelined, out-of-order execution and Information Technologies, Sappom, Japan ,Vol.2, pp. 797-800. reconfigurability features made immense performance [6] Wong S, Anjam F and Nadeem F (2010). Dynamically enhancement by exploiting parallelism for SIMD [70] and Reconfigurable for a Softcore VLIW Processor, IEEE Design, Automation & Test in Europe Conference & Exhibition Lee G. et al. (2009) [71] have made analysis using vector (DATE), Dresden, pp. 969 – 972. processing and SIMD extension supporting programmable solutions with multimedia benchmarks. Xabier et al. [7] Codreanu V, Hobincu R (2010). Performance Gain from Data and Control Dependency Elimination in Embedded Processor, IEEE (2013) [72], Afonso et al. (2011) [73], Tan G. et al. International Symposium on Electronics and Telecommunications (2009)[74], and Tarek-Ghazawi et al. (2008)[75] have (ISETC), Timisoara, pp. 47- 50. proposed reconfigurable architecture, exploring parallel [8] Wong S, T.V. As, and G. Brown (2008): ρ-VEX :A and pipelined computation customizing for a specific Reconfigurable and Extensible Softcore VLIW Processor, in IEEE application. Jinlong et al. (2015) [76], Sai et al. (2009) International Conference on Field-Programmable Technologies [77], Ranganathan et al. (2010) [78] improve performance (ICFPT 08), Taipei, pp-369 - 372. by executing SIMD operations simultaneously using [9] Sommer, L., Stock, F., Solis-Vasquez, L., and Koch, A. multiple functional units of the processor and Kumar et al. (2020). Using Parallel Programming Models for Automotive (2014) [79], Stephan Wong et al. (2010) [80]utilized many Workloads on Heterogeneous Systems - a Case Study. In 28th EUROMICRO International Conference on Parallel, Distributed processing elements (PE) to execute the same operation in and Network-Based Processing (PDP’20). parallel over multiple data. Abbo et al. (2008) [81] proposed Xetal processor with DLP in a massively– [10] Hans-Peter Rosinger (2004). Connecting Customized IP to MicroBlaze Soft Processor Using the Fast Simplex Link Channel, parallel SIMD architecture achieving performance up to Xilinx-XAPP529 , Vol.3,No.1, pp.1-12. 140 GOPs (Group Of Pictures) at clock frequency 110 MHz. Nieto et al. (2015)[83] designed a hybrid 9 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 721 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

[11] Jose Raul Garcia Ordaz and Dirk Koch, A Soft Dual- Many-Core Systems IEEE Transactions On Visualization And Processor System with a Partially Run-Time Reconfigurable Shared , Vol. 18, No. 1, pp-74-90. 128-Bit SIMD Engine, Conference: 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and [28] Borkar S. (2007). Thousand core chips: A technology Processors (ASAP) , July 2018 perspective. DAC ’07: Proceedings of the 44th annual Design Automation Conference. pp. 746-749. [12] Wei Han, Ying Yi, Mark Muir, Ioannis Nousias, Tughrul Arslan, Ahmet T. Edorgan, (2008). Efficient implementation of [29] Saghir M. A. R., El-Majzoub and Akl P. (2006). Datapath and wireless applications on multi-core platforms based on dynamically ISA customization for soft VLIW processors, IEEE International reconfigurable processors, IEEE Conference on Complex, Conference on Reconfigurable Computing and FPGA (ReConFig), Intelligent and Software Intensive Systems, Barcelona, pp.837 – San Luis Potosi, pp. 1-10. 842. [30] Garcia Valls M, Alonso A, and Puente, J.A. (2009). Dynamic [13] Saghir M.A.R, El-Majzoub, and Akl .P (2006). Datapath and adaptation mechanisms in multimedia embedded systems, In Proc. ISA customization for Soft VLIW Processors, IEEE International of IEEE International Conference on Industrial Informatics. pp. Conference on Reconfigurable Computing and FPGA(ReConFig), 188-193. San Luis Potosi ,pp.1-10 [31] OR1200 http://www.opencores.org. [14] Yew P.-C (2003) Is there exploitable thread-level parallelism [32] Razdan R. and Smith M. D. (1994). A high performance micro in general-purpose application programs?, IEEE International architecture with hardware programmable functional units, IEEE Symposium on Parallel and Distributed Processing (IPDPS ’03), Symposium on , pp. 172 - 180 . USA, p. 160.166. [33] Vassiliadis S., Wong S., Gaydadjiev S., Bertels K., Kuzmanov [15] Zine El Abidine and Ahmed (2009). Self partial and dynamic G. and Panainte E. M. (2004). The MOLEN polymorphic processor, reconfiguration implementation for AES using FPGA, International IEEE Transactions on , Vol. 53, No. 11, pp. 1363- 1375. Journal of Computer Science, Vol. 2, No. 2, pp. 33-40. [34] Campi F., Deledda A., Pizzotti M., Ciccarelli C., Mucci C. [16] El-Ghazawi, T.; El-Araby, E.; Miaoqing Huang; Gaj, K.; and Lodi A. (2007). A dynamically adaptive DSP for heterogeneous Kindratenko, V.;Buell (2008). The promise of high performance reconfigurable platforms, IEEE Conference Design, Automation & reconfigurable computing, IEEE Computer Society. Vol 41, No.2, Test in Europe Conference & Exhibition, Nice, pp. 1-6 pp. 69-76. [35] Mucci C., Bocchi M., Gagliardi P., Ciccarelli C., Lodi A., [17] Nicolas Siret Jean-Franc¸ Ois Nezan and Aimad Rhatay Toma M. and Campi F. (2006). A case study on multimedia (2011). Design of a processor optimized for syntax parsing in video applications for the XiRISC reconfigurable processor, IEEE decoders, IEEE International Conference on Design and Symposium on Circuits and Systems, Island of Kos, pp. 4859-4862. Architectures for Signal and Image Processing (DASIP), Tampere, pp.1-8. [36] Foldesy P., A. Zarandy and C. Rekeczky (2008). Configurable 3D-integrated focal-plane cellular sensor- processor array [18] Zine El Abidine and Ahmed (2009). Self-Partial and Dynamic architecture, International Journal of Circuit Theory and Reconfiguration Implementation for AES using FPGA, Applications, Vol. 36, No. 6, pp. 573-588. International Journal of Computer Science, Vol. 2, No.2, pp 33-40. [37] Rob Baxter, Stephen Booth, Mark Bull, Geoff Cawood, [19] Lotfifar. F, and S. Shahhiseini, (2008). Performance Modeling Kenton DMellow, Mark Parsons, James Perry, Alan Simpson and of partially reconfigurable computing systems, In Proc. of the 6th Arthur Trew (2007). High performance reconfigurable computing IEEE/ACS International Conference Systems and Applications the view from Edinburgh adaptive hardware and systems, (AICCSA), pp. 94-99. University of Edinburgh, pp. 273-279. [20] Salih Bayar and Arda Yurdakul (2008). Dynamic partial self- [38] Reynolds P. F., Pancerella C. M. and Srinivasan S. (1993). reconfiguration on Spartan-III FPGAs via a Parallel Configuration Design and performance analysis of hardware support for parallel Access Port (PCAP), Proceedings of the 2nd HiPEAC Workshop on simulations, Journal of Parallel and , Vol. 18, Reconfigurable Computing, Goteborg, Sweden, pp. 1–10. No. 4, pp. 435–453. [21] Liu D, Z. Shao, M. Wang, M. Guo, and J. Xue (2012). [39] Radu P., Michael Sperk and Simon Ostermann (2012). Optimal parallelization for maximizing iteration-level Evaluating high performance computing on Google App engine, parallelism, IEEE Transactions on Parallel and Distributed Systems, IEEE Software- Computer Society, Vol. 29, No. 2, pp. 52-58. Vol.23, No.3, pp. 564–572. [40] Martin C., Herbordt, Tom VanCourt, Yongfeng Gu., Bharat [22] Fahad Siddiqui,,Sam Amiri,Umar Ibrahim Minhas,Tiantai Sukhwani, Al Conti, Josh Model and Doug Disabello (2007). Deng,Roger Woods,Karen Rafferty and Daniel Crookes, FPGA- Achieving high performance with FPGA-based computing, IEEE Based Soft-Core Processors for Image Processing Applications, Computer Society, Vol. 40, No. 3, pp. 50-57 Journal of Signal Processing Systems volume 87, pages139– 156(2017). [41] Mattavelli M., Amer I. and Raulet M. (2010). The reconfigurable video coding standard [Standards in a Nutshell], [23] Maheswari R, Pattabiraman V, Sharmila P, Reconfigurable Signal Processing Magazine IEEE, Vol. 27, No. 3, pp. 159–167. Fpga Based Soft-Core Processor For SIMD Applications, Asian Journal of Pharmaceutical and Clinical Research, Vol. 10, no. 13, [42] Greg Stitt, Alan George, Herman Lam, Melissa Smith, Vikas Apr. 2017, pp. 180-6, Aggarwal, Gongyu Wang, James Coole, Casey Reard, Brian Holland and Seth Koehler (2011). An end-to-end tool flow or [24] Zhuo L. and Prasanna V. K. (2007). Scalable and modular FPGA-accelerated scientific computing, IEEE Design & Test of algorithms for floating-point matrix multiplication of reconfigurable Computers, pp. 68-77. computing systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 18, No. 4, pp. 433-448. [43] Moussali R., Ghanem N. and Saghir A. R. (2007). Supporting multithreading in configurable soft processor cores in CASES ’07: [25] Maheswari.R and Pattabiraman.V (2017). Improved Proceedings of International conference on Compilers, architecture, reconfigurable hyper-pipeline soft-core processor on FPGA for and synthesis for embedded systems, Austria, pp. 155-159 SIMD. International Journal of High Performance Computing and Networking (InderScience), Volume: 10, Issue: 3 ,pp:207-217 [44] Barthe L., Cargnini L. V., Benoit P. and Torres L. (2011). The SecretBlaze: A configurable and costeffective open-source soft-core [26] Yew P.C. (2003) Is there exploitable thread-level parallelism processor, IEEE International Parallel & Distributed Processing in general-purpose application programs?, IEEE International Symposium, Anchorage (Alaska), USA, pp. 310-313 Symposium on Parallel and Distributed Processing, USA, pp. 160- 166 [45] Rahul R., Balwaik, Shailja R. Nayak and Amutha Jeyakumar (2013). Open-Source 32-Bit RISC soft-core processors, IOSR [27] Mark Howison, E. Wes Bethel and Hank Childs (2011). Journal of VLSI and Signal Processing, Vol. 2, No. 4, pp. 43-46 Hybrid Parallelism for Volume Rendering on Large-, Multi-, and 10 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 722 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

[46] Gaisler A. (2010). SPARC V8 32-bit processor [63] Mattson T. G., Van Der Wijngaart R. and Frumkin M. (2008). LEON3/LEON3-FT Companion Core Data Sheet, Version 1.1, pp. Programming the Intel 80-core network-on-a-chip terascale 1-30. processor. Proceedings of the Conference on Supercomputing (SC’08). pp. 1-11. [47] Stephen Craven, Cameron Patterson and Peter Athanas (2005). Configurable soft processor arrays using the openfire [64] Chrysos G. (2012). Intel Xeon Phi co-processor (codename processor, Conference on Military and Aerospace Programmable Knights Corner). Proceedings of the 24th Hot Chips Symposium, Logic Devices, Washington, pp. 7-9. Cupertino, CA, pp. 325- 347. [48] Nieto, Vilarino D. L. and Brea V. M. (2012). SIMD/MIMD [65] Min Z., Leibo Liu, Shouyi Yin, Yansheng Wang, Wenjie dynamically-reconfigurable architecture for high-performance Wang and Shaojun Wei (2010). A reconfigurable multi-processor embedded vision systems, IEEE 23rd International Conference on SoC for media applications. IEEE Symposium on Circuits and Application-Specific Systems, Architectures and Processors, Delft, Systems, Paris, pp. 2011-2014. pp. 94-101. [66] Gundolf Kiefer, Michael Seider and Michael Schaeferling [49] Chuan Hong, Khaled Benkrid, Xabier Iturbe, Ali Ebrahim and (2015). ParaNut-An open, scalable and highly parallel processor Tughrul Arslan (2011). Efficient on-chip task scheduler and architecture for FPGA-based system, Embedded World Conference, allocator for reconfigurable operating systems, IEEE Transactions Nuremberg, Germany, pp. 1-7. on Embedded Systems Letters, Vol. 3, No. 3, pp. 85-88. [67] Diken E., O’Riordan M.J., Jordans R., Jozwiak and Corporaal [50] Haynes S. D., Stone J., Cheung P. Y. K. and Luk W. (2000). H. (2015). Mixed-length SIMD code generation for VLIW Video image processing with the Sonic architecture, IEEE architectures with multiple native vector-widths, IEEE Computer, Vol. 33, pp. 50–57. Conference:Application-specific Systems, Architectures and Processors (ASAP), Toronto, pp. 181-188. [51] Horta E. L., Lockwood John W., E. Taylor and David Parlour (2012). Dynamic hardware plugins in an FPGA with partial run- [68] Park C., Euiseong Seo, Ji-Yong Shin, Maeng and Joonwon time reconfiguration. Design Automation Conference, USA, pp. (2010), Exploiting internal parallelism of flash-based SSDs, IEEE 343-348. Transactions on Letters, Vol. 9, No. 1, pp. 9-12. [52] Ranganathan P., Adve S. and Jouppi N. (2010). Performance of image and video processing with general-purpose processors and [69] Sohan Purohit, Sai Rahul Chalamalasetti, Martin Margala and media ISA extensions, ACM SIGARCH Computer Architecture - Wim Vanderbauwhede (2013). /resource-efficient International Symposium on Computer Architecture, Vol. 27, No. 2, reconfigurable processor for multimedia applications, IEEE pp. 124- 135 Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, No. 7, pp. 1346-1350. [53] Azarmehr M., Ahmadi G.A., Jullien R. and Muscedere (2010). High-speed and low-power reconfigurable architectures of 2-digit [70] Marisol Garca-Valls, Pablo Basanta-Val and Iria Estvez-Ayres two-dimensional logarithmic number system-based recursive (2011). Real-time reconfiguration in multimedia embedded systems, multipliers, IET Circuits Devices System, Vol. 4, No. 5, pp. 374– IEEE Transactions on Consumer Electronics, Vol. 57, No. 3, pp. 381. 1280-1287. [54] Song Sun, Madhu Monga, Phillip H. Jones and Joseph [71] Lee G. G., Chen Y. K., Mattavelli M. and Jang E. S. (2009). Zambreno (2012). An I/O bandwidthsensitive sparse matrix-vector Algorithm/architecture co-exploration of visual computing: multiplication engine on FPGAs, IEEE Transactions on Circuits and overview and future perspectives, IEEE Transactions on Circuits Systems, Vol. 59, No. 1, pp. 113-123. and Systems for Video Technology, Vol. 19, No. 11, pp. 1576- 1587. [55] Somnath Paul, Subho Chatterjee, Saibal Mukhopadhyay and Swarup Bhunia (2011). Energy efficient reconfigurable computing [72] Xabier Iturbe, Khaled Benkrid, Chuan Hong and Ali Ebrahi using a circuit architecture software co-design approach, IEEE (2013). R3TOS: A novel reliable reconfigurable real-time operating Transactions on Circuits and Systems, Vol. 1, No. 3, pp. 369-380. system for highly adaptive, efficient, and dependable computing on FPGAs, IEEE Computer Society, pp. 1542-1556. [56] Gwo Giun Lee, He-Yuan Lin, Chun-Fu Chen and Tsung-Yaun Huang (2012). Quantifying intrinsic parallelism using linear algebra [73] Afonso, Rabie Ben, Alexandre Loyer, Jean-Luc Dekeyser, for algorithm/architecture co-exploration, IEEE Transactions on Belanger and Martial (2011). A prototyping environment for high Parallel and Distributed Systems, Vol. 23, No. 5, pp. 944-957 performance reconfigurable computing, IEEE International Workshop ReCoSoC, pp. 1-8. [57] Zhiyi Yu, Meeuwsen M. J., Apperson R. W., Sattari O., Lai M., Webb J. W., Work E. W., Truong D., Mohsenin T., and Baas B. [74] Tan G., Sun N. and Gao G. R. (2009). Improving performance M. (2008). AsAP: An Asynchronous Array of Simple Processors, of dynamic programming via parallelism and locality on multi-core IEEE Journal of Solid-State Circuits, Vol. 43, No. 3, pp. 695-705. architectures, IEEE Transactions on Parallel & Distributed Systems, Vol. 20, No. 2, pp. 261-274. [58] Zine El Abidine and Ahmed (2009). Self partial and dynamic reconfiguration implementation for AES using FPGA, International [75] Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Journal of Computer Science, Vol. 2, No. 2, pp. 33-40. Gaj, Volodymyr Kindratenko and Duncan Buell (2008). The promise of high-performance reconfigurable computing, IEEE [59] Nobuaki O., Yoshihiro Yasuda, Yoshiki Saito, Daisuke Computer Society, pp. 9-76. Ikebuchi, Masayuki Kimura, Hideharu Amano, Hiroshi Nakamura and Kimiyoshi Usami (2011). Cool Mega-arrays: Ultralow-power [76] Jinlong Xu, Huihui Sun and Rongcai Zhao (2015). SIMD reconfigurable accelerator chips, IEEE Micro-Computer Society, vectorization of nested loop based on strip mining, Software Vol. 31, No. 6, pp. 6-18. Engineering, , Networking and Parallel/Distributed Computing (SNPD), IEEE/ACIS International [60] Sudarshan Banerjee, Elaheh Bozorgzadeh and Nikil Dutt Conference, pp. 1-7. (2009). Exploiting application data-parallelism on dynamically reconfigurable architectures: placement and architectural [77] Sai Rahul Chalamalasetti and Wim Vanderbauwhede (2009). considerations, IEEE Transactions on VLSI Systems, Vol. 17, No. A low cost reconfigurable soft processor for multimedia 2, pp. 234-247 applications: design synthesis and programming model, IEEE International Conference on Field Programmable Logic and [61] Russell Tessier, Kenneth Pocek and Andre Dehon (2015). Applications, Prague, pp. 534- 538. Reconfigurable Computing Architectures, Proceedings of the IEEE, Vol. 103, No. 3, pp. 332—354. [78] Ranganathan P., Adve S. and Jouppi N. (2010). Performance of image and video processing with general-purpose processors and [62] Wentzla D., Grin P., Homann H., Bao L., Edwards B., Ramey media ISA extensions, ACM SIGARCH Computer Architecture - C., Mattina M., Miao C., Brown J. and Agarwal A. (2007). On-chip International Symposium on Computer Architecture, Vol. 27, No. 2, interconnection architecture of the tile processor. IEEE Micro, Vol. pp. 124- 135. 27, No. 5, pp. 15-31. 11 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 723 PROTEUS JOURNAL ISSN/eISSN: 0889-6348

[79] Kumar C. and Azam M. S. (2014). A multi-processing architecture for accelerating Haar-based face detection on FPGA , IEEE, 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1-5. [80] Stephan Wong, Fakhar Anjam and Faisal Nadeem (2010). Dynamically reconfigurable register file for a softcore VLIW processor. IEEE Conference & Exhibition on Design, Automation & Test, Europe, Dresden, pp. 969-972 [81] Abbo A. A., R. P. Kleihorst, V. Choudhary, L. Sevat, P. Wielage, S. Mouy, B. Vermeulen and M. Heijligers (2008). Xetal-II :A 107 gops, 600 mw processor for video scene analysis, IEEE Journal of Solid-State Circuits, Vol. 43, No. 1, pp. 192-201. [82] Nieto A., Vilarino, D.L. and Brea V. M. (2015). Precision: A reconfigurable SIMD/MIMD for computer vision systems-on-chip, IEEE Transactions on Computers, Vol. 9, No. 9, pp. 1-8.

12 VOLUME 11 ISSUE 9 2020 http://www.proteusresearch.org/ Page No: 724