Automatic Instruction-set Architecture Synthesis for VLIW Processor Cores in the ASAM Project

Roel Jordansa,∗, Lech J´o´zwiaka, Henk Corporaala, Rosilde Corvinob

aEindhoven University of Technology, Postbus 513, 5600MB Eindhoven, The Netherlands bIntel Benelux B.V., Capronilaan 37, 1119NG Schiphol-Rijk

Abstract The design of high-performance application-specific multi-core processor systems still is a time consuming task which involves many manual steps and decisions that need to be performed by experienced design engineers. The ASAM project sought to change this by proposing an auto- matic architecture synthesis and mapping flow aimed at the design of such application specific instruction-set processor (ASIP) systems. The ASAM flow separated the design problem into two cooperating exploration levels, known as the macro-level and micro-level exploration. This paper presents an overview of the micro-level exploration level, which is concerned with the anal- ysis and design of individual processors within the overall multi-core design starting at the initial exploration stages but continuing up to the selection of the final design of the individual proces- sors within the system. The designed processors use a combination of very-long instruction-word (VLIW), single-instruction multiple-data (SIMD), and complex custom DSP-like operations in or- der to provide an area- and energy-efficient and high-performance execution of the program parts assigned to the processor node. In this paper we present an overview of how the micro-level design space exploration interacts with the macro-level, how early performance estimates are used within the ASAM flow to determine the tasks executed by each processor node, and how an initial processor design is then proposed and refined into a highly specialized VLIW ASIP. The micro-level architecture exploration is then demonstrated with a walk-through description of the process on an example program kernel to further clarify the exploration and architecture specialization process. The main findings of the experimental research are that the presented method enables an auto- matic instruction-set architecture synthesis for VLIW ASIPs within a reasonable exploration time. Using the presented approach, we were able to automatically determine an initial architecture pro- totype that was able to meet the temporal performance requirements of the target application. Subsequently, refinement of this architecture considerably reduced both the design area (by 4x) and the active energy consumption (by 2x). Keywords: Very Long Instruction Word (VLIW), Application Specific Instruction-set Processor (ASIP), Instruction-set architecture synthesis

1. Introduction

The recent nano-dimension semiconductor technology nodes enable implementation of complex (heterogeneous) parallel processors and very complex multi-processor systems on a single chip (MPSoCs). Those MPSoCs may involve tens of different complex parallel processors, substantial memory and communication resources, and can realize complex high-performance computations

∗Corresponding author Email address: [email protected] (Roel Jordans)

Preprint submitted to Microprocessors and Microsystems April 19, 2017 in an energy efficient way. This facilitates a rapid progress in mobile and autonomous comput- ing, global networking and wireless communication which, combined with a substantial progress in sensor and actuator technologies, creates new important opportunities. Many traditional ap- plications can now be served much better, but what is more important, numerous new sorts of smart communicating mobile and autonomous cyber-physical systems became technologically fea- sible and economically justified. Various systems performing monitoring, control, diagnostics, communication, visualization or combination of these tasks, and representing (parts of) different mobile, remote or poorly accessible objects, installations, machines, vehicles or devices, or even being wearable or implantable in human or animal bodies can serve as examples. A new wave of information technology revolution is arriving that started to create much more coherent and fit to use modern smart communicating cyber-physical systems (CPS) that are part of the internet of things (IoT). The rapidly developing markets of the modern CPS and IoT represent great opportunities, but these opportunities come with the high system complexity and stringent requirements of many modern CPS applications. Two essential characteristics of the mobile CPS are: • the requirement of (ultra-)low energy consumption, often combined with the requirement of high performance, and

• heterogeneity, in the sense of convergence and combination of various earlier separated ap- plications, systems and technologies of different sorts in one system, or even in a single chip or package. Smart wearable devices represent a huge heterogeneous CPS system area, covering many differ- ent fields and kinds of applications (from mobile personal computing and communications, through security and safety systems to wireless health-care or sport applications), with each application having its different specific requirements. Also a modern car involves numerous sub/systems im- posing different specific requirements. What is however common to virtually all these applications is that they are autonomous regarding their limited energy sources, and therefore they are re- quired to ensure (ultra-)low energy consumption. Some of these applications do not impose any high demands in computing and communication performance, and can satisfactorily be served with simple micro-controllers and a short distance communication with e.g. a smartphone. Some other applications impose more substantial computing and communication performance requirements and need more sophisticated low-power micro-controllers or micro-controller MPSoCs, equipped a. o. with on-chip embedded memories, multiple sensors, A/D converters and Bluetooth Low Energy, ZigBee or multi-standard communication. However, many of the modern complex and highly-demanding mobile and autonomous CPS impose difficult to satisfy ultra-high demands for which the micro-controller or general-purpose processor based solutions are not satisfactory. These are mainly applications that are required to provide continuous autonomous service in a long time and involve big instant data generated/- consumed by video sensors/actuators and other multi-sensors/actuators, and in consequence, at the same time demand (ultra-)low energy consumption (often below 1 W) and (ultra-)high per- formance in computing and communication (often above 1 Gbps). Examples of such applications include: mobile personal computing, communication and media, video sensing and monitoring, computer vision, augmented reality, image and other multi-sensor processing for health-care and well-being, mobile robots, intelligent car applications, etc. Specifically, smart cars and various wearable systems to a growing degree involve big instant data from multiple complex sensors (e.g. camera, radar, lidar, ultrasonic, EEG, sensor network tissues, etc.) or from other systems. To process these big data in real-time, they demand a guaranteed (ultra-)high performance. Being used for safety-critical applications and demanded to provide continuous autonomous service in a long time, they are required to be highly reliable, safe and secure. The (ultra-) high demands of the modern mobile CPS combined with todays unusual silicon and systems complexity result in serious system development challenges, such as: • guaranteeing the real-time high performance, while at the same time satisfying the require- ments of (ultra-)low energy consumption, and high safety, security and dependability;

2 • accounting in design for a lot of aspects and changed relationships among aspects (e.g. leakage power, negligible in the past, is a very serious issue now); • complex multi-level multi-objective optimization (e.g. processing power versus energy con- sumption and silicon area) and adequate resolution of numerous complex design trade-offs (e.g. between various conflicting requirements, solutions at different design levels, or solutions for different system parts); • reduction of the design productivity gap for the increasingly complex and sophisticated systems; • reduction of the time-to-market and development costs without compromising the system quality, etc. To satisfy the above discussed demands and overcome the challenges, a substantial system ar- chitecture and design technology adaptation is necessary. New sophisticated high-performance and low-power heterogeneous computing architectures are needed involving highly-parallel application- specific instruction-set processors (ASIPs) and/or HW accelerators, as well as, new effective design methods and design automation tools are needed to efficiently create the sophisticated architec- tures and perform the complex processes of application mapping on those architectures. The systemic realizations of complex and highly demanding applications, for which the cus- tomizable multi-ASIP MPSoC technology is especially suitable, demand the performance and energy usage levels comparable to those of ASICs, while requiring a short time-to-market and remaining cost-effective. Satisfaction of these stringent and often conflicting application demands requires construction of highly-optimized application-specific hardware architectures and corre- sponding software structures mapped on the architectures. This can be achieved through an effi- cient exploitation of different kinds of parallelism involved in these applications, implementation of critical parts of their information processing in application-specific hardware, as well as, efficient trade-off exploitation among various design characteristics, and between solutions considered at different design levels and for different system parts. Modern complex mobile applications include many various parts and numerous different algo- rithms involving various kinds of information processing with various kinds of parallelism (task- level, loop-level, operation/instruction-level, and data parallelism). They are from their very nature complex and heterogeneous and require heterogeneous application-specific multi-processor system approach. In a customizable multi-ASIP MPSoC its various ASIPs have to be customized together, and in a strict relation to the selection of the number of ASIPs, as well as to the mapping of the applications required computations on the particular ASIPs in both space and time. The MPSoC macro- and micro-architectures, at multi-ASIP system level and at the single ASIP proces- sor level, are strictly interrelated. Important trade-offs have to be resolved regarding the number and granularity of individual processors, and the amount of parallelism and resources at each of the two architecture levels. Moreover, at both architecture levels, the optimized parallel software structures have to be implemented on the corresponding optimized for them parallel hardware structures. The two architecture levels are strongly interwoven also through their relationships with the memory and communication structures. Each micro-/macro-architecture combination with a different parallel computation structure organization requires different compatible memory and communication architectures. Moreover, the traditional algorithm and software development approaches require an existing and stable computation platform (HW platform, compilers etc.), while for the modern embedded systems as MPSoCs based on adaptable ASIPs the hardware and software architectures have to be application-specific, and must be developed largely in parallel. Unfortunately, the efficiency of the required parallel HW and SW development is much too low with the currently available development technology due to lack of effective automated methods of industrial strength for many MPSoC design problems, and weak interoperability of the HW/SW architecture design, and hardware synthesis tools. The inefficiencies identified above result in a substantially lower than attainable quality of the resulting systems, much longer than necessary development time, and much higher development costs.

3 In consequence, optimization of the performance/resources trade-off required by a particular application can only be achieved through a careful construction of an adequate application-specific macro-/micro-architecture combination, where at both architecture levels, the optimized parallel software structures are implemented on the corresponding optimized for them parallel hardware structures. The aim here is thus to find an adequate balance between the number and kinds of parallel processors, the complexity of the inter-processor communication, and the intra-processor parallelism and complexity. To achieve this aim, based on the application analysis and restruc- turing several promising macro-architecture/micro-architecture combinations have to be automat- ically constructed and evaluated in an iterative process, and finally, the best of them has to be selected for an actual realization. This paper considers the highly relevant problem of an automated design technology for 1) constructing high-quality heterogeneous highly-parallel computing platforms for the above briefly discussed highly-demanding cyber-physical applications, and for 2) efficient mapping of the com- plex applications on the platforms. Such new design technology for ASIP-based MPSoCs satisfying the briefly sketched above requirements has been developed in the scope of the European research project ASAM (Architecture Synthesis and Application Mapping for heterogeneous MPSoCs based on adaptable ASIPs) of the ARTEMIS program. Its overview can be found in [1]. This paper is devoted to a part of the ASAM design technology related to the design methods and tools for a single ASIP-based HW/SW sub-system and developed by the ASAM research team from the Eindhoven University of Technology led by the second author of the paper. Since the design flow developed by the ASAM project substantially differs from earlier published flows, after discussing the related work, the paper briefly explains the ASAM design flow. Subsequently, it focuses on the part of the design flow related to the automatic VLIW ASIP-based HW/SW architecture ex- ploration and synthesis (which has not been published in as much detail previously), and explains the design methods and automation tools of this part. The paper proposes a novel formulation of the ASIP-based HW/SW sub-system design problem as the actual hardware/software co-design problem, i.e. the simultaneous construction of an optimized parallel software structure and a cor- responding parallel ASIP architecture, as well as mapping in space and time of the constructed parallel software on the parallel ASIP hardware. It proposes a novel solution of the so formulated problem composed of a new design flow, its methods and corresponding design automation tools. Specifically, it extensively elaborates on the Instruction-Set Architecture Synthesis of the adapt- able VLIW ASIPs and discusses the related results of experimental research. The experimental results clearly confirm the high effectiveness and efficiency of the developed design methods and design automation tools.

2. Related work

The development of contemporary digital systems heavily relies on electronic design automation (EDA) tools. Placing, sizing, and connecting the 1 billion transistors of a contemporary MPSoC simply is not possible without a huge amount of fully automated design assistance. Historically, EDA tools focussed solely at placement and routing of transistors. However, over time this limited approach became infeasible as circuit complexity increased. As a result, EDA tools adapted libraries of higher-level standard components. Initially these components were simple logic gates (and, or, etc.), but later usage of only these small blocks also proved insufficient and larger so called Intellectual Property (IP) blocks were added to the libraries. These IP blocks can be as simple as a memory controller, but may also contain complete processors including local cache memories. Nowadays the design and support of such IP libraries containing complex processor IP has become an important part of the digital electronics design industry and the sole reason for the existence of companies such as ARM or Imagination Technologies. Managing a system-level design involving several such complex IP blocks is a very complex task which requires highly specialized tools. Currently three major EDA tool vendors deliver such tools

4 (Synopsys1, Cadence2, and Mentor Graphics3), and virtually everyone designing or using IP blocks will be using the EDA tools of one or more of these companies. , the smallest of the three, focusses mostly on the realization of designs provided by human experts and doesn’t (by itself) provide much support for choosing between alternative high-level designs. The tool that comes closest to providing an automated application-to-design path is Calypto Design Systems’ Catapult-C, which started off as a product from Mentor Graphics. Catapult-C, however, is mostly aimed at the high-level synthesis of hardware accelerators only and has no special advantage when used to design application specific processor architectures. However, it can be useful when creating complex custom operations for integration into a new processor design. In contrast, Cadence and do provide tools which allow for more automatic design of both hardware accelerators and application specific processor architectures. This paper presents an overview of several of the currently available methods for (automated) design of customized application specific processor architectures, which are in a quite close relation to the presented research. However, the original design choices of these previous approaches often do have a lasting impact on the abilities and strengths of each of these toolflows as will be discussed below. This section first presents the commercially available tools from both Cadence and Synopsys, and then continues with a presentation of the recent research on the topic. We finalize the related work with a discussion of the SiliconHive/ ASIP based MPSoC design framework that was used within the ASAM project.

2.1. Commercial EDA tools Both Cadence and Synopsys provide a large portfolio of EDA tools. These various tools are aimed at different phases of the design process, and can often be used in combination with each-other in a semi-integrated fashion to offer a complete design flow from a high-level design problem specification to a detailed circuit design. In the last decade, through a series of external acquisitions, both vendors have been moving to include more high-level design tools in their tool frameworks. This section will briefly discuss the tools of both vendors which are relevant in relation to automated instruction-set architecture synthesis of VLIW processors, the topic of this work.

2.1.1. Cadence Similarly to Mentor Graphics, Cadence traditionally focussed on providing tools that take a complete design and implement it in the latest technology. As such, Cadence mostly provides EDA tools that take a prepared high-level system design and iteratively translate the design into a more detailed lower-level design until the actual circuit is realized. However, more recently, Cadence has strengthened its position in the automated high-level design market, first by acquiring Tensilica in 2013, and thereafter by the acquisition of Forte in 2014. Forte’s Cynthesizer tool together with the Cadence C-to-Silicon design-flow provided Cadence with a high-level synthesis design-flow similar to that of Mentor Graphics. However, as with Mentor Graphics, Forte’s tools and Cadence’s C-to-Silicon design-flows mostly focus on high- level synthesis of hardware accelerators and less at the automatic synthesis of application-specific processor architectures. The acquisition of Tensilica changed this. Tensilica was a company that specialized in customizable programmable IP solutions. Their tools include a language that allows a designer to describe a new processor architecture at the instruction-set architecture level and re-use the processor IP. Based on this architecture descrip- tion, the Tensilica tools automatically generate the processor architecture hardware-design to- gether with the required support software to program the newly designed processor architecture. These tools now live on as part of the Cadence XTensa tool-suite.

1http://www.synopsys.com 2http://www.cadence.com 3http://www.mentor.com

5 The Cadence XTensa design-flow4 automates the construction of new processor architectures and their corresponding support software. The designer is presented with a configurable base pro- cessor architecture which can be extended with extra operations. These operations are specified manually by the expert designer and are included directly in the processor datapath as designer defined instructions. Both hardware description (RTL) of an instantiated and/or extended proces- sor architecture and supporting system modeling and software tools (simulator/compiler) are then generated for the revised architecture in minutes. This provides an expert user with a methodol- ogy to quickly evaluate the effect of different processor architecture variations on the performance of the target application. This design-flow helps a lot when designing an application specific pro- cessor architecture, but still relies on design exploration and decisions of an experienced human designer. Identifying customization possibilities and other extensions, such as the addition of cus- tom operation patterns, require either the usage of external tools or the presence of an expert user.

2.1.2. Synopsys Synopsys, currently the largest of the three main EDA companies, has been involved in elec- tronic system-level design a bit longer than the other two but, like Cadence, has also recently been expanding its interest in processor architecture synthesis tools. These tools include Synopsys Processor Designer5 which features the design-flow that was acquired from Coware in 2010, and the Synopsys ASIP Designer tools6 (formerly IP Designer) which were acquired from Target in 2014. The Processor Designer toolflow allows a user to describe a processor architecture in the LISA architecture description language and automatically creates both the hardware description and software support tools. The LISA language provides a high flexibility to describe the instruction- sets of various processors, such as SIMD, MIMD and VLIW-type architectures [2, 3]. Moreover, processors with complex pipelines can be easily modeled. The original purpose of LISA was to automatically generate instruction-set simulators and assemblers [3]. It was later extended to also include hardware synthesis [2]. Synopsys ASIP Designer provides a different architecture description language (nML [4, 5]) which is also aimed at the description and synthesis of application specific processor architectures and their support software. The nML language is very similar in intents and purpose to the LISA language but several subtle differences exist. For example, nML aims more directly at describing the detailed instruction-set architecture of the processor, including the semantics and encoding of instructions, which makes it slightly more suited for generating a retargetable C compiler together with the simulator and processor hardware [4]. This stronger focus on generating a full compiler does however restrict the possible hardware optimizations compared to the LISA based flow. For example, sharing hardware resources within a function unit is more limited for nML based architectures than it is for those described using LISA [2]. Outside of these small differences both Processor Designer and ASIP Designer remain very similar. In both cases an expert user has to provide an architecture description from which the tools are then able to generate a compiler and simulator. Using these generated tools, the user can then compile and simulate the target application on the proposed architecture. Cycle count and resource usage statistics are then gathered by the user upon which further alterations to the processor architecture can be proposed. The selection and implementation of these extensions is done manually by the user. Hardware (RTL) generation is usually only performed after the user is satisfied with the performance of a simulated version of the processor because of the time consuming nature of running the actual hardware synthesis and the extremely slow RTL simulation. Synopsys also offers a high-level synthesis tool called Synphony C Compiler. Again, this

4http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable 5http://www.synopsys.com/systems/blockdesign/processordev 6http://www.synopsys.com/dw/ipdir.php?ds=asip-designer

6 high-level synthesis tool is aimed more at non-programmable hardware accelerators and less at application specific processor architecture design. However, this hasn’t always been the case. The Synphony C Compiler was the product of Synfora which originated in 2003 from the PICO project as a start-up company. PICO, which stands for Program-In, Chip-Out, did much more than the synthesis of hardware accelerators and will be discussed in more detail in Section 2.2.3.

2.2. Tools from research projects In parallel to the commercial offerings discussed above, several research projects have recently been performed in relation to the high-level design of application specific processor architectures. Most of these research projects focused on providing or improving architectural description lan- guages for the construction of application specific processor architectures (see Section 2.2.1. Two projects differentiate themselves from the others in that they provide support for automated archi- tecture exploration. These projects, the TCE framework and the PICO project, will be discussed separately and in more detail in Sections 2.2.2 and 2.2.3 respectively.

2.2.1. Architecture description languages Several domain specific languages have been developed for the description of both functionality and structure of application specific processor architectures. Using such an architecture description language (ADL) and its related EDA tools allows a designer to quickly instantiate variations of a processor architecture and consider the effects of design choices on the cost and performance of the resulting processor. Various design analysis tools, including simulation, are commonly provided with the ADL tools for this purpose. Examples of such languages are the nML and LISA languages as used by Synopsys ASIP Designer and Synopsys Processor Designer respectively. More diversity exist in research: ArchC [6] is still being developed at the University of Campinas in Brazil7, Codal [7, 8] is being researched by both Codasip8 and the Technical University of Brno in the Czech Republic, and a new version (3.0) of LISA [2, 9, 10] is in development at RWTH Aachen University9. The current research on these languages and their respective frameworks focusses mostly on the translation of a high-level processor architecture description into a corresponding structural description in a hardware description language, such as VHDL or , as well as, the generation of programming tools such as a C/C++ compiler or assembler, debugging and instruction-set simulation tools, and application analysis and profiling tools. In some cases (e.g. LISA 3.0 and ArchC), support for system-level or multi-processor integration is also being added. LISA 3.0 also improves on its previous incarnation by the addition of support for reconfigurable computing [9, 10]. A reconfigurable processor introduces a reconfigurable logic component in or near the datapath. Such a component can implement either a fine grained reconfigurability, similar to a small field programmable gate array (FPGA), or coarse grained, more like a coarse grained reconfigurable array (CGRA) [11]. This addition allows for further customization of the instruction-set even after the finalization of the processor silicon, for example during the processor initialization or possibly even at runtime. In general, the process of using these ADL based tools is very similar to that of the Synop- sys tools described above. Support is provided for constructing both development tools such as a compiler and simulator, as well as, an RTL description of the hardware of a processor speci- fied using the ADL. Application analysis tools focussing at highlighting hot spots and candidate instruction-set extensions can also be provided to the designer. However, like with the Cadence and Synopsys tools, the final decision making on which processor variation to consider for the next design iteration is left to the expert designer. Many more ADL frameworks exist and this section named only a few which were relevant to the research presented in this paper; for more information on this topic see the book “Processor Description Languages” by Mishra and Dutt [12].

7http://www.archc.org 8http://www.codasip.com 9http://www.ice.rwth-aachen.de/research/tools-projects/lisa/lisa

7 Figure 1: Transport Triggered Architecture processor templatea

aSource: http://tce.cs.tut.fi/screenshots/designing_the_architecture

2.2.2. TCE: TTA-based Co-design Environment The TTA-based Co-design Environment10 is a set of tools aimed at designing processor ar- chitectures according to the Transport Triggered Architecture (TTA) template. TTA processors are, like VLIW processors, a sub-set of the explicitly programmed processor architectures and can be seen as exposed datapath VLIW processors. Unlike VLIW processors, TTA processors do not directly encode which operations are to be executed, but are programmed by specifying data movements. As a result, all register file bypassing from function units and register files is fully exposed to the compiler. The TTA programming model has the benefit of enabling soft- ware bypassing, a technique where short-lived intermediate results of computations are directly forwarded to the function-unit consuming the data, while completely bypassing the register file. This reduces both the register file size and port requirements, as well as, energy consumption for the TTA architecture compared to more traditional VLIW architectures. A similar reduction of the register file energy consumption can be obtained using hardware bypassing [13, 14], but that technique generally has a larger hardware overhead as it requires run-time detection of bypassing opportunities. Figure 1 illustrates the TTA processor architecture template. It shows how the function units, register file, and control unit are connected through sockets to the transfer buses. Programming is achieved by controlling the connections of the sockets with the buses. From this figure it is also clear that register file bypassing can be implemented in software, simply by forwarding a result from one function unit directly to the input of another. Research on Transport Triggered Architectures started with the MOVE project [15, 16] at Delft University of Technology during the 90s. Later, when the Delft MOVE project was discontinued, Tampere University of Technology continued the research, and created the next generation of the MOVE framework which they named the TTA-based Co-design Environment. Hoogerbrugge and Corporaal [17, 15, 16] investigated automatic synthesis of TTA processor architectures as part of the MOVE project. Derivatives of this work are still available as the MOVE-PRO architecture[18] and within the TCE. Chapter 6 of the TCE manual [19], titled “Co-design tools”, is dedicated to the tools available for supporting automatic processor architecture exploration. These tools

10http://tce.cs.tut.fi/

8 Input C code 1

Compute-intensive VLIW code kernel 5 2

4

NPA constructor

VLIW compiler NPA parameters 3 Machine 8 description Nonprogrammable NPA spacewalker database accelerator Executable

7 VLIW Control VLIW constructor processor Cache spacewalker interface Abstract architecture Cache 9 specifications 6 parameters Cache hierarchy VLIW spacewalker Cache constructor 10 PICO-generated system

Figure 2: PICO design-flow organization [20] are somewhat similar to the tools and techniques presented in this paper. However, this paper presents several techniques and tools that target another processor architecture style (VLIW). Most of the architecture exploration techniques proposed in this paper, after small modifi- cations, could also apply to the TCE and could help to further reduce TTA-based processor architecture exploration times.

2.2.3. PICO: Program-In Chip-Out As mentioned above, the PICO project, grandparent to some parts of the current Synopsis design-flow, offered more than only the automatic synthesis of hardware accelerators which was incorporated into the Synphony C compiler. In its original form PICO covered the automatic syn- thesis of a processor system containing a set of non-programmable hardware accelerators combined with a single VLIW processor [20, 21]. In essence, the goals of PICO were very similar to those of the ASAM project. Both projects aimed to automatically develop an application specific multi-processor system. However, there are also several key differences. PICO approaches the problem by synthesizing a system with a single VLIW processor and a set of hardware accelerators, whereas the ASAM project utilizes one or more heavily specialized highly parallel VLIW processors and no hardware accelerators. Several other substantial differences between these two approaches stem mostly from the differences in their VLIW processor templates. For example, the PICO VLIW processor template uses a single register file for each data-type (integer, floating point, etc.), this severely limits the number of operations which can be executed in parallel. Many read ports need to be available to provide the operands to each operation executed in parallel. Large many ported register files quickly become very expensive regarding both area and energy, and limit the maximum operating frequency of the PICO VLIW processor [22, 23].

9 The PICO design-flow, illustrated in Figure 2, provides a fully automated design flow for developing the non-programmable accelerator (NPA) subsystems, the VLIW control processor, and the cache memory hierarchy. To limit the size of the design space, each of these three components (NPAs, VLIW architecture, and cache hierarchy) is explored independently from the others. This is different from our approach. Considering all three components combined resulted in a too large design space which severely limited the effectiveness of any automated exploration [20, 21]. During the exploration, a Pareto-optimal set of solutions is obtained for each of the three system components. First compute-intensive kernels are identified (2) (see step 2 in Figure 2) in the input C code (1) and NPA architectures are explored for each of these kernels (3), (4). The compute- intensive parts are then replaced with calls to the hardware accelerators in the original C code (5) and a set of alternative VLIW processor architectures is then designed (6), (7), (8). Finally, the cache hierarchy is tuned for the memory requirements of the application (9) and compatible VLIW, cache, and NPA designs are combined to form a set of Pareto-optimal designs (10). The focus for the system architecture exploration by PICO is on the trade-off between the area and timing. Area is measured in either physical chip area or gate count, whereas an estimation of the application’s processor runtime is used for the timing. Similar to our approach, the timing estimate is computed as the total sum of each basic block’s schedule length multiplied by its profiled execution count.

2.3. The SiliconHive tools The initial work on the SiliconHive VLIW ASIP development technology started at Philips. The results from that research were carried over into SiliconHive when it was spun-out of Philips Research as part of the Philips Technology Incubator program in 2003. Intel acquired SiliconHive in 2011 after a period of growth and further development. This section uses a selection of the infor- mation made available through the ASAM project and publications from both the earlier Philips and SiliconHive periods, to introduce the SiliconHive VLIW ASIP architecture template and the pre-existing SiliconHive development framework which were exploited by the ASAM project.

2.3.1. Overview Similarly to the other tool flows discussed in this chapter, the SiliconHive tool flow, illustrated in Figure 3, offers an architecture description language (called TIM) and a set of tools to generate hardware RTL descriptions, an instruction-set simulator, a retargetable C compiler, and various other debugging and software development tools. In parallel, the SiliconHive tools also offer a second language (called HSD) that allows a user to construct a multi-processor system consisting of one or more VLIW processors (specified using the TIM language) and other hardware components (such as hardware accelerators, memory units, and peripherals) taken from a library or imported from external sources. Using an original application description (usually expressed in sequential C), the user starts by composing or selecting an initial MPSoC platform design and then decomposing the application code into a parallel version tuned for the initial platform. After mapping the application onto the platform, the user can then compile and simulate the so designed HW/SW system to find the re- maining critical points of the design. Using this information, the user then can manually refine the parallelization, mapping, and/or the platform composition in an iterative design process. Several pre-selected processor and platform designs are available for an easy start, but the parallelization, mapping, and design-space exploration steps have to be performed manually.

2.3.2. Architecture template The SiliconHive ASIP design technology offers a highly flexible template of a customizable VLIW processor. Figure 4 illustrates this processor architecture template. A processor (cell) is organized in two parts, one part (coreio) contains the local memories and handles the interface with the external world, while the other part (core) performs the actual operations as described by the program in the local program memory.

10 Application Platform design design

data C TIM TIM TIM HSD

Parallelization TIM compiler IP blocks

P1 P2 P3 C C C HSD compiler bus application platform

Mapping

System asm asm asm simulation HiveCC C C C

P1 P2 P3 P1 P2 P3

bus bus system system

SiliconHive tool

manual GenSys VHDL

Figure 3: SiliconHive flow showing the steps required for creating a custom MPSoC platform and its corresponding application implementation.

Interfacing to the external world and memories. SiliconHive processors usually contain one or more local scratchpad memories which are used for low-latency storage of (intermediate) data used by the algorithm running on the processor. Several memories with different organization (e.g. different sizes and/or data-widths) and differently implemented (e.g. register-based or SRAM) can be included in a single processor as required by the target application. Local memories are also connected through a slave interface to the global MPSoC interconnect hierarchy so that the other elements of the MPSoC containing the SiliconHive processor can access these memories when providing input and/or consuming output to/from the SiliconHive processor. One or more master interfaces may also be present in the coreio. These master interfaces can either connect to an external memory, a direct memory access (DMA) controller, or the local memory of another SiliconHive processor. Stream interfaces (FIFOs) can also be added to the processor. Such FIFO interfaces are mainly suitable for small data transactions and are usually utilized by providing hand-shake signals while performing larger transactions through a master interface transfer. In parallel to these data storage and transfer components, the coreio of a SiliconHive processor also contains the program memory, as well as, a set of status and control registers which can be used for reconfiguring the processor. Examples of such reconfiguration are actions like reprogramming the program memory, entering/leaving the low-power sleep mode, and starting/stopping a kernel. The core itself contains a very small sequencer block which interfaces the datapath with the status and control registers and fetches the appropriate instructions from the program memory. Instruction decoding is kept relatively cheap in the SiliconHive processor architecture by using a horizontally programmed processor style. Instruction bits usually correspond directly to a subset of the configuration bits of the processor registers and input select multiplexers.

Core structure. The actual execution of the program is performed inside the datapath. Here, operations are executed within issue-slots. Each issue-slot is composed of a set of function-units

11 register register register status program file file file register counter register

programmable interconnect

function function function function function unit unit unit unit unit

issue slot issue slot issue slot

data path sequencer

core

address mapping address address mapping mapping

storage storage storage storage storage status program or I/O or I/O or I/O or I/O or I/O & control memory device device device device device bank

logical memory logical memory logical memory

address address address mapping mapping mapping

slave slave slave interface interface interface

coreio

cell stream slave slave master slave port port port port port

Figure 4: SiliconHive processor architecture template [1] which implement the actual operations. This division of issue-slots into function-units can be somewhat confusing as most of the related work [24, 22, 16, 15, 23] uses the term function-unit (or functional-unit) to designate what is called an issue-slot in SiliconHive terminology. However, this paper will use the SiliconHive terminology since it builds upon their processor architecture framework. Within the datapath, issue-slots are connected to multiple register files using a (optionally shared) interconnect [24, 25, 26, 27]. This provides an efficient implementation for very wide VLIW processor architectures without incurring the overhead of a large, centralized, register file [22, 23]. The explicitly programmed nature of the SiliconHive processors makes it relatively easy to com- pute the final instruction width for newly constructed processor architectures. This is especially useful when modeling the impact of architecture changes.

2.4. Compiler support for VLIW processors One of the major difficulties in the development and usage of VLIW processors is the ability of the programmer to obtain a sufficiently high resource utilization of the processor. The high amount of programming freedom provided by explicitly programmed processor architectures, which

12 enables the high performance benefits of these architectures, also increases the complexity of the scheduling process. Combining this with the presence of large, complex, custom operations and VLIW scheduling techniques such as software pipelining [28] easily results in a highly complex compiler. This high compiler complexity, up to the point where the creation of a compiler becomes infeasible, is often one of the motivations for moving parts of the decision making process to the programmer, the user of the compiler. Annotations added to the input C code allow for a simplification of the compilers decision making process which often leads to a better final result when an experienced programmer is using the compiler. Even with a good compiler, annotations may bring a substantial profit as they allow for localized overrides of the compiler heuristics in cases where sub-optimal results are being produced.

Source code annotation. Source code annotation is a popular technique to enable some of the more esoteric processor features without invasive changes to the compiler itself. Classical examples are intrinsics and annotations that capture the mapping of data into one of the (possibly many) different memories of a processor. However, even when the compiler itself is capable of optimizing for such complex processor features hints for enabling and disabling these specific optimizations can also be provided as source code annotations. Complex optimizations performed by the compiler can be very time consuming, while most of their benefit can only be observed for a (small) portion of the application code. Allowing these optimizations to be enabled for limited parts of the application code can significantly speedup the compilation process.

Complex operations. Complex operations, such as an FFT butterfly operation or those working on vector elements, are difficult to represent or efficiently detect in the C language. For this reason, most compilers provide direct access to such operations through intrinsics. A compiler intrinsic looks like a C function call, but it is translated within the compiler directly into the (complex) operation it represents. This allows the programmer to force the compiler to select the intended operation and allows for a strong simplification of the custom operation selection heuristics in the compiler itself.

2.4.1. Memory mapping The mapping of (global) data arrays onto one of the local memories of a processor is often controlled by added annotations. For example, OpenCL [29] recognizes private as a keyword which denotes that the thus marked data should be mapped into the private memory of the processor running the kernel.

2.4.2. Optimization hints Optimization hints can often be provided either by using standardized C keywords such as inline, restrict, and register, by in-line annotations such as builtin expect() in GCC, or by compiler directives using #pragma statements.

2.4.3. Code transformations Code transformations form a base for to enabling a high instruction level parallelism (ILP) and data level parallelism (DLP), but also for controlling the buffer sizes required for intermediate data. Code transformations, such as speculation and unrolling, or scheduling techniques, such as software pipelining, are mostly aiming at increasing the explicitly available ILP within important sections of the program. However, such ILP enhancing techniques usually also come at the cost of an increase of the program size. The SiliconHive compiler supports these ILP enabling optimizations both as automatic optimizations, but also provides the user with direct control through enabling/disabling these optimizations for parts of the code using annotations. Loop transformations, a subset of the code transformations with focus on the structure of loop nests, generally aim at controlling the distribution and size of data elements communicated between consecutive loop nests, but also affect the sizes of data communicated between the processor and external inputs and outputs (such as external memories or other processing tiles). The loop

13 transformations used within the ASAM project are mainly: loop fusion, loop tiling, and loop vectorization. These loop transformations are currently not provided by the SiliconHive compiler and are performed within the ASAM tools as source-level code transformations. For the purpose of loop transformations, formal mathematical models of loop nests, such as the Polyhedral model [30], are used. Such formal models allow for a direct analysis of the effects of loop transformations on the memory requirements of the transformed code. Although the SiliconHive compiler does not yet support such loop transformations, several tools are already available [31, 32, 33, 30, 34] which allow for translations to and from the polyhedral domain, as well as, the automatic exploration of loop transformations. Section 3 further illustrates how these methods are used within the instruction-set architecture synthesis and the ASAM project in general.

2.4.4. Extensions for architecture exploration Using a compiler in the context of an architecture construction or exploration framework increases the demands on the compiler. A lot of time can be saved if the same compiler can be used for several variations of a specialized processor without the need for rebuilding (parts of) the compiler. In the SiliconHive compiler, this is achieved through the addition of several compiler controls. These compiler controls allow the programmer to override the set of available processor resources that the compiler is allowed to use. Such compiler flags make it very easy to investigate the effects of removing specific function unit, or even a complete issue-slot, from a proposed processor architecture. However, code annotations pertaining to resource allocation, such as an explicit function unit binding or the use of a complex operation through an intrinsic, are considered definitive by the SiliconHive compiler. Thus, removing an explicitly used resource from a processor will result in a conflict with source code annotations present in the target application. For example, removing a function unit may result in the removal of an intrinsic from the compiler which was used in the target application. In some cases, replacement or emulation code can be provided by the programmer. Such emulation code needs to be provided in the source code form itself, as pre- compiled emulation libraries will not have the opportunity to take removed resources into account. A similar difficulty appears when a load-store unit is removed, making one of the memories of the candidate processor inaccessible. Such a removal can invalidate the current memory mapping of the target application in a way that can not be covered easily by the use of replacement code. It is quite likely that the entire memory mapping needs to be reconsidered in such a case. As a result, either the exploration can not have the freedom of removing load-store units and their related issue-slot and memory interfacing hardware, in which case the exploration needs to be provided with a set of different initial architectures representing different memory mappings; or the compiler needs to provide an automated method for finding an appropriate distribution of data across memories so that the distribution of data across different local memories will not require annotation.

3. ASAM design flow As previously mentioned, the ASAM project focusses on the automatic synthesis of heteroge- neous VLIW ASIP-based multi-processor systems. The ASAM design-flow is built upon the VLIW MPSoC design framework of Intel Benelux (formerly SiliconHive) that was presented above. The aims of the ASAM project included the automation of several of the (previously) manual design steps. In particular the exploration, analysis, and decision making regarding the multi-ASIP plat- form design, application parallelization, ASIP customization, and application mapping steps were considered for automation. One of the key problems addressed by ASAM is that it is impossible to perform an efficient ap- plication parallelization and mapping without information on the performance of the application parts on specific processing elements of the platform, but it is equally impossible to construct a rea- sonable multi-processor platform and each of its application-specific processing elements without knowledge of the parallelization and mapping due to the cyclic dependency between the two.

14 The ASAM project tries to break the cyclic dependency of this ‘chicken and egg’ problem by tight coherent coupling of various design stages and phases, as well as, through statically computing early performance estimates of single tasks, and combining them to propose promising application mappings using a probabilistic application partitioning and parallelization phase. The ASAM approach divides the MPSoC design space exploration into two main stages: the macro-architecture exploration, and the micro-architecture exploration. These two stages are tightly coupled to form a coherent MPSoC architecture exploration framework. The macro-level architecture exploration focusses on the construction of the system out of a set of (initially un- known) VLIW ASIP processors, whereas the micro-level architecture exploration focusses on the analysis of a set of tasks assigned to a single ASIP and the construction of new VLIW ASIPs specialized for (groups of) specific tasks. Figure 5 illustrates the tight bi-directional cooperation between the macro- and micro-level architecture exploration. Further information on the ASAM flow can be found in [1]. The work presented in this paper lies within the micro-level architecture exploration and focusses mostly on the application analysis and mapping and the corresponding ASIP instruction-set architecture synthesis (highlighted in Figure 5). However, to adequately place this work in its context some more information is needed about the macro-micro interaction, as well as, on the application intra-task parallelization within the micro-level stage.

3.1. Macro- and micro-architecture exploration The ASAM flow starts 1 with its input composed of the C-code of the target application, as well as, user supplied constraints and design objectives. From these, it extracts the overall application structure and a set of compute intensive kernels using the Compaan Compiler [35], which translates these kernels into tasks in a Polyhedral process network 2 . The usage of the Compaan Compiler poses several constraints on the provided C-code to enable automatic analysis (e.g. affine loop structures and access patterns and statically allocated buffers). However, not all applications translate equally well to this structure. We therefore allow for more complex applications by providing an option to start the architecture exploration with a user provided graph model of the application. In this case the only constraint is that the application can be decomposed into several communicating data-flow tasks. The difficulty here is that such a manually provided model only has limited optimization potential in the ASIP memory hierarchy exploration (discussed below) which also works within the Polyhedral model. The user supplied constraints and design objectives consist of both structural and parametric requirements. These requirements guide the automated tools and are used to control exploration aspects, such as the granularity of the computational tasks and the available processor architecture components (both structural requirements), but also to allow user defined limits on the energy consumption, throughput, and maximal area occupation (parametric requirements). Through these constraints and objectives, the user is able to control both the size and complexity of the overall exploration problem and can influence the trade-offs that are considered during the design process.

3.2. Finding good task combinations to be executed in single ASIPs Each of the tasks of the Polyhedral process network is then analyzed by the micro-level ap- plication analysis 3 . The best-case and worst-case execution times of tasks for a future VLIW processor of unknown issue-width, designed according to the SiliconHive template, is then de- termined using the parallelism estimation methods presented in [36, 37]. These best-case and worst-case execution time estimates, computed by the micro-level for the set of tasks assigned by the macro-level to be executed on a single ASIP, are then taken by the macro-level and used during the probabilistic system exploration [38, 1, 39]. Using an evolutionary algorithm combined with Monte Carlo simulation and models of the inter-processor communication the probabilistic system exploration finds promising task clusterings. For each of the task-clusterings the micro- level is then consulted to produce an initial parallel computation structure of the task cluster and a corresponding coarse customized VLIW processor architecture 4 .

15 Input

C code Macro-level Micro-level 1 System input 3 WCET Application Stimuli Analysis

User constraints 4 ASIP design 7 C transformation Application Probabalistic C-to-C exploration 5 ASIP Design + Sim. Parallelization 6 ASIP instantiation SiliconHive tools Compaan 2 Task parallelization 8 Intel tools Compiler TIM Generation (App. Analysis) HiveCC

16 9 Instruction-set Instruction-set 10 BuildMaster TIM Compiler architecture 11 Instruction-set + Sim. synthesis HSD Compiler 13 Power control 14 UNICA simulator System-level 12 Comm. optimization Deterministic Power control Interconnect and exploration System simulator Memory DSE

15 System prototyping SHMPI Genesys prototyping Processor area/energy 16 System output model

Complete system architecture

Figure 5: ASAM flow overview with the micro-level architecture exploration covered in this paper highlighted in red. 3.2.1. Deciding the ASIP memory hierarchy and initial VLIW ASIP architecture In the micro-level application-parallelization and coarse architecture synthesis stage, multi- ple tasks within a single cluster may be transformed so that their data-locality and re-use are optimized. The micro-level application parallelization tool [1, 40] transforms the Polyhedral rep- resentation of the clustered loop kernels to reorder and fuse the kernel executions in such a way as to find possible promising trade-offs between the data-throughput and the processor area and power consumption. Although at this stage, the power consumption is assumed to be proportional to the area, which is a well funded assumption [41] that sufficiently serves this early exploration stage. Currently, loop fusion, loop tiling, and kernel vectorization are considered during this exploration, but the repertoire of transformations can be extended. Loop fusion combines two kernels into a single new kernel which localizes any intermediate results which were communicated between the original kernel pair. This reduces the total memory requirements for the task cluster. Loop tiling enlarges the granularity of data on which the kernel is running. This allows for a trade-off between the size of the local cache or scratch-pad memories in the processor versus the bandwidth required between the considered processor and its external memory. Loop vectorization changes the data granularity for the computations performed within the kernel and reduces the number of instructions required for the execution of a given application (part). The application parallelization uses estimates of the instruction-level parallelism (ILP) available in each kernel. These estimates are obtained as one of the results of the application analysis, as explained in [36, 37]. The application parallelization creates an optimized parallel structure of the application part mapped to a single VLIW ASIP and constructs a corresponding initial VLIW architecture with sufficient resources to achieve the required throughput. A Pareto-set of such coarse VLIW architecture designs is then returned to the macro-level exploration 5 which then uses the performance metrics of these more accurate designs in combination with a set of communication models (obtained from the communication and global memory exploration) to determine the final multi-processor system architecture. Each of these returned candidate VLIW processor architectures can be synthesized 6 and a transformed C code can be generated to match the selected high-level loop transformations 7 . After the ASIP synthesis and corresponding C code generation, the ASIP based hardware/software subsystem is ready for simulation using the SiliconHive tools 8 .

3.2.2. Finalizing the ASIP design The final optimization step is applied when the area and/or energy consumption of the thus far synthesized ASIP based platform is not yet satisfying the design constraints or for further opti- mization of the design objectives. This optimization is performed by the micro-level instruction-set architecture synthesis, separately for each single VLIW ASIP in the system and its corresponding application part when requested by the macro-level architecture exploration 9 . This proces- sor architecture optimization step tries to improve the ASIP architecture both by the addition of application-specific instruction-set extensions as custom operations [42, 43, 44], and by the removal of unused or scarcely used processor components which were included as part of the pro- cessor building blocks used during the construction of the initial processor prototype. Several improvements to the processor area and energy models used in this step were previously presented in [45], the exploration algorithms of this step in [46, 47], and techniques for reducing the explo- ration time 10 in [48]. The resulting refined ASIP design, together with more precise area, energy, and execution time estimates, are then returned to the macro-level architecture exploration 11 .

3.2.3. Finalizing the MPSoC platform Finally, the macro-level architecture exploration continues with an exploration of the MPSoC interconnect and global memory structure 12 based on the available design alternatives for the

17 1 int input_image[N][M]; 2 int temp_image[N/2][M]; 3 int output_image[N/2][M/2]; 4 5 void downsample2d(void) 6 { 7 int h, w; 8 9 // kernel 1: vertical down sampling 10 for(h=0; h < N/2; h++) { 11 for(w=0; w < M; w++) { 12 temp_image[h][w] = 13 (input_image[2*h][w] + input_image[2*h+1][w]) >> 1; 14 } 15 } 16 17 // kernel 2: horizontal down sampling 18 for(h=0; h < N/2; h++) { 19 for(w=0; w < M/2; w++) { 20 output_image[h][w] = 21 (temp_image[h][2*w] + temp_image[h][2*w+1]) >> 1; 22 } 23 } 24 }

Listing 1: 2D down sampling a N ×M image, original code

various VLIW processor cores in the system. The macro-level also selects the appropriate VLIW instances from the set of refined architectures produced through the micro-level architecture ex- ploration. Combining the selected VLIW instances and the synthesized interconnect and global memory structure allows the introduction of system level power control 13 (e.g. voltage scaling, power gating) after which the full system can be simulated through the SiliconHive tools 14 or emulated on FPGA 15 . After this final validation of the system design, the tools can finalize the design and (semi-)automatically produce the required RTL and software descriptions for perform- ing further (ASIC) hardware synthesis 16 and the production of the final system prototype.

4. ASIP architecture exploration: An example

This section presents an example walk-through of the automatic ASAM micro-level architecture exploration and synthesis process, to further illustrate the ASAM approach for designing a fully customized VLIW ASIP processor together with its corresponding parallel software structure. The application shown in Listing 1, 2D down sampling, was selected for this demonstration. Down sampling is a common function in image processing which benefits from most of the considered transformations without being overly complex, and therefore it is appropriate to be used for the explanation. However, the methods presented here are equally applicable to much more complex applications and algorithms. As can be seen from the code in Listing 1, 2D down sampling consists of two main kernels, performing vertical and horizontal down sampling respectively. As the kernels are fairly small (with only a few operations each) and at the same time have a quite high communication requirement (half of the original image is communicated from the first kernel to the second kernel), it is fairly likely that the macro-level architecture exploration will decide to map both kernels onto a single processor. Array-OL[49], a graphically enriched representation of the polyhedral model, is used within the ASAM project in the second phase of the micro-level architecture exploration for perform- ing the application restructuring and coarse processor architecture synthesis. Figure 6a shows a graphical representation of the Array-OL model corresponding to the input code of the 2D down sampling application. Both vertical and horizontal down sampling kernels (V scale and H scale

18 N x M N/2 x M N/2 x M/2

V scale H scale

(a) Original code structure

N x M N/2 x M/2

V scale H scale

(b) After kernel fusion

N x M N/2 x M/2

V scale H scale

(c) After both kernel fusion and vectorization with vectorized data elements colored gray and a double box for denoting the vectorized iteration domain

Figure 6: Array-OL representation of the horizontal and vertical down-sampling kernels after different optimizations respectively) are shown as grey rectangles. The input and output sizes for each kernel iteration are illustrated next to the input and output ports. Both kernels consume two data elements and produce a single data element. The main difference between them is the orientation of the two consumed elements in the 2D data domain. The repetition domain surrounds each kernel and represents the loop-nest that wraps around each kernel in Listing 1. From both the Array-OL model and the original source code, it can be observed that this implementation of the algorithm requires half of the original input image in temporary storage locations, as well as, both the complete input and output image directly accessible by the processor. This data needs to be stored either in the ASIP local memories or in a, usually slower, external memory.

4.1. Application code restructuring and initial architecture construction The application parallelization phase of the micro-level architecture exploration[1, 40, 50] starts to optimize the data locality and memory architecture of the customized processor with the Array- OL model of Figure 6a as input. The first transformation that will be considered is loop (or kernel) fusion. The main advantages of this transformation are a reduced temporary data storage and an increased kernel size. Figure 6b illustrates the effect of kernel fusion on the example code. In order to perform loop fusion, a single repetition domain needs to be put around both kernels. Before this is possible, the V scale kernel will need to be executed twice so that it produces the appropriate data elements for the H scale kernel. This results in a second repetition domain wrapping only the V scale kernel. This second repetition domain only has two iterations and will be unrolled in the generated C code. Unrolling

19 repetition domains with a small repetition count increases the kernel size and usually has a positive effect on the instruction-level parallelism available in the application. As can be seen in Figure 6b, loop fusion significantly reduces the temporary storage requirements of the application. In this case, loop fusion successfully removed the entire temporary storage except for a few single data elements. After the application of fusion, both vectorization and tiling are explored using a genetic algorithm[50, 40]. Vectorization is mainly used to increase the number of data elements that are processed per cycle, as it changes the data granularity at which the computations are performed. Two repetition domains exist in the fused version of the application, the inner and the outer repetition domain. Both have no dependencies on previous iterations and can be vectorized at will. However, the inner repetition domain wrapping the V scale kernel only has two iterations and will provide only very limited performance impact when vectorized. The more logical choice is to unroll this inner loop and to vectorize the outer repetition domain. The main limitation of the vectorization here is that vectorization puts a constraint on the possible input image sizes (unless strip-mining is used or explicit padding is added to the data). In this case we assume that the input image width is a multiple of 32, which results in a maximum vector width of 16 since the application needs to read two (vector) elements next to each other to feed the inner repetition domain. Figure 6c illustrates the vectorized and fused kernels. A double box on the outer repetition domain and the colored data elements illustrate the changed data granularity. Care should be taken when vectorizing the H scale kernel, as it consumes two consecutive data elements to produce its result. When vectorizing such a kernel, vector shuffling operations are required to reorganize the data in such a way that the consecutive elements are put into separate vector elements so that the original kernel code can be kept. Finally, tiling is applied to the restructured loop nest to enable more freedom in the global mapping of the input and output data arrays. Tiling is used in the ASAM project to improve the data locality of the processor cores and to adequately distribute data between the global memory and the local memories of each processor. This allows a significant reduction in the required size of local memories for the final processor designs. It also allows us to use direct-memory- access (DMA) controllers to perform data transfers in parallel to the actual computations. A tiled implementation therefore enables streaming operation of the kernel. The restructured kernels can start processing data while it is still being produced by the source of the data (an image sensor or other processing step in a larger application) which represents a major benefit. The only requirement for a streaming application is that initially there is sufficient data to perform at least one iteration and that there is space available to store the (partial) results. During the loop transformation exploration process, the parallelism available in the core loop nest(s) of the code is estimated, using the methods presented in [36]. This enables the code restructuring and initial architecture construction process to construct an ASIP architecture that has an issue-width that is appropriate for obtaining the predicted throughput. Figure 7 shows the processor that was constructed for the transformed C-code of the down-sampling application. The constructed processor is a wide VLIW ASIP with 4 16-way vector issue-slots, 2 vector memories for storing the data of the input and output buffers, 3 scalar issue-slots (including fifo and control operations), and a small scalar memory for storing the width and height parameters of the kernel. Running the restructured code on the so constructed initial processor architecture demon- strates that this ASIP architecture and parallel application code combination is indeed capable of processing pixels at the predicted throughput of a full vector width of input pixels (16 pixels) per cycle. The transformed code also utilizes the proposed processor architecture quite well, as it has an 85% utilization of the issue-slots in the core loop. However, large parts of the instruction-set remain unused which results in an inefficient use of the provided function units and requires an unnecessarily wide program memory. As such, this architecture is still quite overdimensioned and can substantially be reduced during the architecture refinement phase to decrease both the area and power consumption, while still realizing the required throughput. At this point, the tasks of the code restructuring and initial architecture construction are completed and the second phase of the micro-level architecture exploration can be concluded. The proposed initial processor architecture is returned to the macro-level architecture exploration

20 down_core_streaming_processor down_core_streaming

c0_c5_is6_is_is_op0_BUS c0_c5_is5_is_is_op0_BUS c0_c5_is4_is_op0_BUS c0_c5_is3_is_op0_BUS c0_c1_is2_is_op_BUS c0_c1_is1_is_is_op_BUS c0_c1_is0_is_is_op_BUS c0_c5_is6_is_is_op2_BUS c0_c5_is6_is_is_op1_BUS c0_c5_is5_is_is_op2_BUS c0_c5_is5_is_is_op1_BUS c0_c5_is4_is_op2_BUS c0_c5_is4_is_op1_BUS c0_c5_is3_is_op2_BUS c0_c5_is3_is_op1_BUS c0_c1_is0_is_is_sr_BUS c0_c1_is0_is_is_pc_BUS

c0_c1_is0_is_rf c0_c1_is0_is_pc c0_c1_is0_is_sr c0_c1_is1_is_rf c0_c1_is2_rf c0_c5_is3_rf0 c0_c5_is3_rf1 c0_c5_is3_rf2 c0_c5_is4_rf0 c0_c5_is4_rf1 c0_c5_is4_rf2 c0_c5_is5_is_rf0 c0_c5_is5_is_rf1 c0_c5_is5_is_rf2 c0_c5_is6_is_rf0 c0_c5_is6_is_rf1 c0_c5_is6_is_rf2 64 x 32 64 x 32 32 x 32 8 x 32 8 x 512 4 x 16 8 x 32 8 x 512 4 x 16 8 x 32 8 x 512 4 x 16 8 x 32 8 x 512 2 x 16

c0_c1_is0_is_is c0_c1_is1_is_is c0_c1_is2_is c0_c5_is3_is c0_c5_is4_is c0_c5_is5_is_is c0_c5_is6_is_is

bru suu aru bmu lgu lsu mac psu shu aru bmu lgu mac psu shu sru aru bmu lgu mac psu shu varu vintr vlgu flgu vmul vpsu vtransp vshu psu fpsu vslice varu vintr vlgu flgu vmul vpsu vtransp vshu psu fpsu vslice varu vintr vlgu flgu vmul vpsu vsalsu vtransp vshu psu fpsu vslice varu vintr vlgu flgu vmul vpsu vsalsu vtransp vshu psu fpsu vslice 21

c0_c1_is0_dmem_mem c0_c1_is0_config_pmem_conf_pmem c0_c1_is1_fifo_fifo c0_c5_is5_vmem_mem c0_c5_is6_vmem_mem

mem stat_ctrl pmem fifo0 fifo1 mem mem 2 x 32 9 x 32 64 x 355 2 x 32 2 x 32 8 x 512 2 x 512

Figure 7: Initial processor architecture for the down-scaling application based on code restructuring and initial architecture construction exploration decisions which can now explore the system memory architecture based on each processor’s minimal memory size requirements. Extra buffer space can be added based on the overall mapping of the target application tasks and their respective communication buffer size requirements during the global interconnect and memory hierarchy exploration.

4.2. ASIP instruction-set architecture synthesis through architecture refinement As already mentioned above, the initial coarse processor architecture proposed by the second phase of the micro-level architecture exploration is usually overdimensioned. It is composed of issue-slots taken from a standard library and thereby supports a large variation of operations in each issue-slot. However, usually not all of these operations need to be replicated into each issue- slot and many of them can be removed without any impact on the execution time of the target application. In some specific cases, even the number of the VLIW issue-slots can be reduced. Furthermore, register files also have been introduced with quite large sizes and may be reduced as well. Removing these redundant resources from the initial architecture will greatly simplify the structure of the interconnect between the issue-slot outputs and the register file inputs. This in turn can result in a large reduction of the number of program word bits required for each instruction in the program memory which further reduces the processor’s area and energy consumption. The third phase of the micro-level architecture exploration, the instruction-set architecture syn- thesis, performs the architecture refinement using, in combination, the following three refinement techniques: instruction-set extension, (passive) architecture shrinking, and (active) architecture reduction.

4.2.1. Instruction-set extension Instruction-set extension can (optionally) be applied as the first step of the instruction-set architecture synthesis when a very high performance and/or an extremely low power solution is required. During this step, frequently occurring operation patterns are identified in the target application, function units implementing each of them in hardware are constructed, and then the application specific instructions corresponding to them are added to the instruction-set of the initial prototype which was obtained from the previous exploration phase. For example, the down sampling application has a frequently occurring pattern where two values are added and the result is shifted by one place, which effectively computes the average of both values. This operation pattern is the key computation in both the vertical and horizontal down sampling kernels. Implementing the complete pattern as a single complex operation provides two benefits: it results in a smaller kernel body, as fewer operations are required to encode the entire algorithm, and it improves the latency of the kernel execution by executing both the addition and the shift in the same clock cycle. As an additional advantage, the use of complex operations also reduces the number of register file accesses as intermediate values between the operations in a pattern are no longer stored in the register file. This results in a further reduction in the energy consumption and can result in a lowered register file pressure which allows us to further reduce the register file size. The detection and selection of candidate operation patterns for the creation of custom opera- tions, as well as, the insertion of these custom operations into the initial prototype is performed by the designer of the processor with the help of an operation pattern enumeration tool[42, 43, 44]. This tool supports the designer by enumerating candidate operation patterns based on the fre- quency of their occurrence, as well as, the possibilities for hardware sharing between custom operations.

4.2.2. Architecture shrinking Architecture shrinking passively strips components from an oversized ASIP architecture that are unused for execution of a given parallel application structure, and estimates the effects of the removal of these components on the area and performances of the so modified processor architecture design. During the shrinking process, individual issue-slots, function-units, register- files, memories, and/or (custom) operations can get removed from the architecture. Register files

22 and memories can also be resized to provide exactly the required amount of space. As a result, the connectivity of interconnect, as well as, the size of the instruction word and program memory, can be drastically reduced, without any negative impact on the temporal performance (latency, throughput) of the overall ASIP design. The passive shrinking approach is fully implemented in the area and energy modeling [48]. Doing so allows for an efficient estimation of the benefits of a specific architectural shrinking. In our implementation, we allow for user control of the kinds of elements which are removed during the shrinking process. For example, enabling or disabling the removal of single operations from a candidate architecture can have a large impact on the functional completeness and re- programmability of the resulting processor. Removing all operations except those that are required for the functioning of the target application will result in an architecture that may only (effectively) support small variations on the original algorithm but will also provide a higher efficiency that is closer to a non-programmable hardware implementation. However, keeping some of the less costly (though currently unused) instructions in some of the issue-slots of the final architecture design can improve the support for variations of the original algorithm at the cost of a somewhat lower energy efficiency. Deciding the granularity at which to perform the exploration depends strongly on the intended purpose of the design and, as such, is left to the designer of the processor architecture.

4.2.3. Architecture reduction Often, performing only the passive architecture shrinking for a given compiled parallel appli- cation structure will not result in the most efficient architecture design. For instance, instruction scheduling heuristics may have decided to map several operations of the same kind onto differ- ent issue-slots, while these could have been mapped into the same issue-slot. This may result in the same operation being supported by multiple issue-slots when this is not strictly required for achieving the required performance of the algorithm. The ASIP architecture reduction actively tries to suppress the usage of specific architecture components (issue-slots or function-units) by disabling them as selection alternatives for an operation in the compiler. Doing so will render them unused in the resulting application mapping which allows for the successive architecture shrinking to remove them from the final design. The instruction-set exploration strategies implemented for this active reduction stage have been described in [46, 47]. An appropriate combination of the above three architecture refinement techniques results in a fast and efficient method for investiga- tion of instruction-set architecture variants of the initial prototype produced in the second stage of the micro-level architecture exploration and selection of the most preferred architecture.

4.3. Experimental results Application of our architecture refinement techniques on the example down-sampling applica- tion demonstrates their high effectiveness. Figure 8 shows the architecture optimization effects during various stages of the optimization on both the area and energy consumption of different architecture variations. Each bar in the graphs is subdivided to show the area and energy dis- tribution across the various components of the processor architectures and shows the most costly components at the bottom. It should be noted however, that these figures only show the energy and area cost of the core processor architecture with its local memories. The other components of the system such as the (usually larger but much cheaper per bit) external memories are not accounted for in the graphs. The first (leftmost) bar shows the area and energy requirement of the initial prototype that was proposed by the previous micro-level architecture exploration phase. All bars are normalized in relation to the initial prototype. The second bar illustrates the effect of adding the custom operation pattern for the add-shift (avg) custom operation which is applied on two vectors of data elements. It shows the increase in the processor area (due to the extra hardware) and a decrease in the active energy consumption due to a decreased number of reads from the register file. The total execution time of the algorithm did not change due to the inclusion of this operation. This can be explained since the cycle-count of the down-sampling application is limited by the initiation interval of the main loop kernel,

23 h nlata fl-utm mlmnaini losonfrrfrnet lutaeteetmto ro nthe in error estimation the illustrate to reference for shown Both also prototype. initial is to implementation the normalized (with (full-custom) exploration. optimization, version actual of extended stages final an different the and during prototype energy initial and the area Estimated 8: Figure 0.0 0.2 0.4 0.6 0.8 1.0

initial

initial+avg a area (a) shrinking

shrinking+avg

reduced issue.slots register.files pmem dmem fifo reduced+avg

full-custom 24 averaging 0.8 1.0 0.0 0.2 0.4 0.6

prto)aeson h rhtcuecs of cost architecture The shown. are operation) initial b yai energy dynamic (b)

initial+avg

shrinking

shrinking+avg issue.slots register.files pmem clk interconnect instruction.decoder fifo dmem

reduced

reduced+avg

full-custom which in turn is dominated by the amount of load operations from the input memory. Instead, the introduction of the custom operation results in a decrease of the required number of issue-slots for the processor. This is achieved by joining operations that were already scheduled in parallel into a single custom operation. The main area and energy savings from this optimization are the removal of an issue-slot and its register files, as well as, the corresponding reduction in the width of the program memory, as fewer operations need to be encoded per instruction. The result of passively shrinking for both the initial architecture and the version which includes the custom operation is shown in bars three and four respectively. Shrinking has a large impact on the size of the issue-slots, as many standard operations can be completely removed and others are only needed in a few issue-slots. The register files provided by the initial architecture are also significantly over-dimensioned. The results shown in this example demonstrate those obtained using a full-customization of the processor architecture and include the removal of all unused elements from the initial (extended) architecture. Continuing the process by actively exploring further possible architecture reductions results in another decrease in both the area and energy requirements for the processor architecture design, as shown by the fifth and sixth bars. In both cases, the design was optimized to improve the energy- delay product of the final processor in an attempt to find a smaller, more efficient, version of the architecture without giving up too much on the temporal performance. As a result, both proposed reduced architectures (with and without custom operation) consume about 5% less energy when compared to their shrunk versions. For the processor architecture which includes the custom operation, the architecture reduction phase was able to remove a complete vector issue-slot. This resulted in a final proposed architecture which has an as high as 91% utilization of its issue-slots. A significant speedup of the exploration process can be achieved when it is observed that there are many symmetrical intermediate exploration points when an initial architecture composed of templated issue-slots is reduced. For example, removing an operation from one issue-slot may only result in the compiler using that same operation from another issue-slot. In some cases this can still be beneficial but in many cases an equivalent solution will be found. To improve upon this we have introduced the BuildMaster framework [48]. This framework recognizes when the compilation result for a new architecture alternative will be equivalent to one that was considered earlier, and will directly reuse the performance estimation of the previously considered equivalent architecture. This significantly reduces the amount of compilation and performance estimation runs that may be required during the overall exploration. Furthermore, the BuildMaster framework also recognizes when two architecture alternatives have different resulting compiled applications but which share the same control-flow structure internally. In this case the compilation of the target program is still required but we may be able to re-use the simulation results that were obtained for the previously considered alternative. More detail on these techniques is available from [48]. The exploration time required for both the original and extended initial architectures also clearly shows the effectiveness of our intermediate result caching techniques. The exploration of both architectures took only 54 minutes in total on a 2.8 GHz Intel Core i7 with 6 GB of RAM memory. Our BuildMaster caching framework was able to reuse a significant number of compilation and simulation results, over 72% of the compilation time and over 94% of the simulation time were avoided by the use of our caching techniques. Overall, the automatic exploration framework considered 407 different processor architecture variations in less than 1 hour and produced two highly optimized processor designs. Both final designs proposed by the automated architecture exploration reduce the area of the initial design by more than a factor of 4x, and the energy consumption by almost 2x. After the exploration, we verified the predicted results by comparing them to the results obtained by actually constructing this proposed architecture using the SiliconHive tools. The result of this verification is demonstrated in the final bar of the bar-graph and the resulting architecture is shown in Figure 9. From this experiment we can learn that the area required for actually constructing the proposed processor architecture is slightly higher, while the energy required for running our algorithm on this architecture is slightly lower than predicted. These effects demonstrate some of the limitations of our current area and energy model in relation to the architecture template. In the case of our down sampling application, three discrepancies between

25 down_core_fullcustom

c0_c5_is5_is_is_op0_BUS c0_c1_is2_is_op_BUS c0_c1_is1_is_is_op_BUS c0_c1_is0_is_is_op_BUS c0_c5_is5_is_is_op1_BUS c0_c5_is4_is_is_op1_BUS c0_c5_is3_is_op1_BUS c0_c5_is4_is_is_op0_BUS c0_c1_is0_is_is_sr_BUS c0_c1_is0_is_is_pc_BUS

c0_c1_is0_is_rf c0_c1_is0_is_pc c0_c1_is0_is_sr c0_c1_is1_is_rf c0_c1_is2_rf c0_c5_is3_rf1 c0_c5_is4_is_rf0 c0_c5_is4_is_rf1 c0_c5_is5_is_rf0 c0_c5_is5_is_rf1 3 x 32 2 x 32 3 x 32 5 x 512 5 x 32 1 x 512 2 x 32 3 x 512

c0_c1_is0_is_is c0_c1_is1_is_is c0_c1_is2_is c0_c5_is3_is c0_c5_is4_is_is c0_c5_is5_is_is

bru suu aru lgu lsu mac sru shu aru lgu psu vavg vsalsu vsalsu vtransp 26

c0_c1_is0_dmem_mem c0_c1_is0_config_pmem_conf_pmem c0_c1_is1_fifo_fifo c0_c5_is4_vmem_mem c0_c5_is5_vmem_mem

mem stat_ctrl pmem fifo0 fifo1 mem mem 2 x 32 9 x 32 64 x 134 2 x 32 2 x 32 8 x 512 2 x 512

Figure 9: The final full-custom architecture based on the exploration result the predicted and obtained area and energy numbers can be observed. Firstly, the load-store unit, connected to the input memory, is only used for load operations. However, being a load-store unit, it is derived from a standard template library element which has a fixed set of input and output ports. Therefore, an extra register file must be added to be able to actually construct the processor without diverging from the current template, and in order to keep all input and output ports correctly connected. This is the 3th register file from the right in Figure 9, which is not used but does provide space for one 512 bit wide vector element and represents approximately 10% of the register file area in the final architecture. Improving the architecture template library such that it allows for a load-store unit with only load operations (and thus fewer input ports) would enable the removal of this extra register file and bring the total register file area back down to the predicted space. The automated tools presented in this paper are currently not able to perform this optimization but the SiliconHive template does allow the construction of such a load-only unit. Secondly, the architecture model is completely agnostic of the operations implemented within function units (except for their names) and only counts accesses to register files. This is a limitation of the current implementation of the exploration tools and should be resolved as part of the future work. As a result, the architecture exploration is currently unable to recognize when a specific register file read or write port is unused in the proposed architecture. Removing such unused register file ports during the final architecture construction further simplifies the interconnect, reduces the cost of the register files in general, and reduces the number of instruction word bits required for programming the final processor architecture. Finally, the automated exploration framework resizes the program memory based on the num- ber of instructions required for encoding the target application. However, in its current implemen- tation, it does not take into account the small processor initialization routine that also needs to be loaded into the program memory. In the case of the down sampling application, the addition of this initialization code results in a different rounding of the program memory size (from 32 lines to 64 lines). This results in a program memory area which is larger than predicted, though not twice as large as fewer bits are required to encode the instruction word due to the reduction of the interconnect complexity as explained previously.

5. Conclusion In the scope of the ASAM project a novel automated design technology has been developed for constructing of high-quality heterogeneous highly-parallel ASIP-based multi-processor platforms for highly-demanding cyber-physical applications and efficient mapping of the complex applications on the platforms. This paper discussed a part of the ASAM technology. The paper briefly explained the ASAM design flow that substantially differs from earlier published flows, focused on the design methods and tools for a single VLIW ASIP-based HW/SW sub-system synthesis, extensively elaborated on the Instruction-Set Architecture Synthesis of the adaptable VLIW ASIPs and discussed the related results of experimental research. As explained in the previous sections, the ASAM design flow and its tools implement an actual coherent HW/SW co-design process through performing a quality-driven simultaneous co- development and co-tuning of the parallel application software structure and processing platform architecture to produce HW/SW systems highly optimized for a specific application. Moreover, the ASAM flow and its tools consider the macro-architecture and micro-architecture synthesis as one coherent complex system architecture synthesis task, and not two separate tasks, as in the state-of-the-art methods. There are common aims and a strong consistent collaboration between the two sub-tasks. The macro-architecture synthesis proposes a certain number of customizable ASIPs of several types with a part of the application assigned to each of the proposed ASIPs. The micro-architecture synthesis proposes architecture for each of the ASIPs, together with its local memories, communication and other blocks, and correspondingly restructures its software to implement the assigned application part as effective and efficient as possible. The decisions on the application-specific processor, memory and communication architectures are made in a strict collaboration to ensure their compatibility and effective trade-off exploration among the

27 different design aspects. Subsequently, the RTL-level HDL descriptions of the constructed ASIPs are automatically generated and synthesized to an actual hardware design, and the restructured software of the application parts of particular ASIPs is compiled. From several stages of the application restructuring and ASIP design, including the actual HW/SW implementation, the micro-architecture synthesis provides feedback to the macro-architecture synthesis on the physical characteristics of each particular sub-system implemented with each ASIP core. This way the well-informed system-level architecture decision re-iterations are possible and the macro-/micro- architecture trade-off exploitation is enabled. After several iterations of the combined macro- /micro-architecture exploration and synthesis an optimized MPSoC architecture is constructed. To our knowledge, the ASIP-based MPSoC design flow as explained above is not yet explored in any of the previously performed and published works. The related research in the MPSoC, ASIP, application analysis and restructuring, and other areas considers only some of the sub-problems of the adaptable ASIP-based MPSoC design in isolation. In result, the proposed partial solutions are usually not directly useful in the much more complex actual context. The paper proposed a novel formulation of the ASIP-based HW/SW sub-system design problem as the actual hardware/software co-design problem, i.e. the simultaneous construction of an optimized parallel software structure and a corresponding parallel ASIP architecture, as well as mapping in space and time of the constructed parallel software on the parallel ASIP hardware. It discussed a novel solution of the so formulated problem composed of a new design flow, its methods and corresponding design automation tools. As explained in Sections 3 and 4, the ASIP- based HW/SW sub-system architecture exploration and synthesis flow involves the following three main phases: Phase1: performing application analysis and characterization; Phase2: performing coherent co-synthesis of parallel software structures and related parallel pro- cessing, communication and storage architectures; Phase3: performing instruction-set architecture synthesis and related refined application restruc- turing. While Phase2 performs an actual coarse software/hardware co-development through exploring and selecting promising parallel software structures and the corresponding ASIP hardware archi- tectures, Phase3 performs a fine hardware and software co-tuning through refinement of the coarse architecture and application software. Phase 2 produces one or several promising coarse designs of a parallel ASIP-based HW/SW system. Each of these initial prototypes guarantees satisfaction of all the hard constraints (e.g. computation speed), but can be oversized. For instance, an initial prototype may involve too many and/or too large register files, memories or communication structures, it may involve more issue slots than necessary for the satisfaction of the hard constraints, its issue slots only include the standard operations and do not include application-specific operations, some of its issue slots may involve too many operations. Phase 3 accepts such a coarse, possibly oversized, ASIP-based HW/SW sub-system design as its input and performs its precise instruction-set architecture syn- thesis. The ISA synthesis of Phase3 involves several collaborating processes as the instruction-set extension that replaces some of the most frequent compound operation patterns with application- specific instructions, architecture shrinking that removes the unused components from an oversized architecture, and architecture reduction that disables usage of specific operations for application mapping, observes the consequences and removes them and their related hardware from the ASIP architecture if the consequences are positive. Moreover, we introduced two intermediate result caching techniques which greatly accelerate the iterative process of architecture refinement by memorizing and reusing information on the previously considered architectures. The BuildMaster caching framework automatically recognizes when previous compilation and/or simulation results can be re-used and this way provides a big reduction of the required architecture exploration time. Finally, the paper discussed the results of experimental research that compared several implemen- tations of the presented architecture refinement methods, demonstrated the benefits of combining the shrinking and reduction techniques, and compared their effectiveness and exploration time.

28 The main findings of the experimental research are that the presented method enables an au- tomatic instruction-set architecture synthesis for VLIW ASIPs within a reasonable exploration time. Using the presented approach, we were able to automatically determine an initial archi- tecture prototype that was able to meet the temporal performance requirements of the target application. Subsequently, refinement of this architecture considerably reduced both the design area (by 4x) and the active energy consumption (by 2x). This was achieved by automatically exploring 407 architecture variations in a targeted search for the most suitable design. The entire process for proposing and refining the custom VLIW ASIP into this final design took only about 1 hour. Furthermore, we found that the final design has a low resource overhead and comes very close to the design that an experienced engineer would propose for the same target application. The experimental results confirm the high effectiveness and efficiency of the developed design methods and design automation tools.

Acknowledgement

The research reported in this paper has been performed in the scope of the ASAM project of the European ARTEMIS Research Program and has been partly supported by the ARTEMIS Joint Undertaking under Grant No. 100265.

References

[1] L. J´o´zwiak,M. Lindwer, R. Corvino, P. Meloni, L. Micconi, J. Madsen, E. Diken, D. Gan- gadharan, R. Jordans, S. Pomata, P. Pop, G. Tuveri, L. Raffo, G. Notarangelo, ASAM: Au- tomatic architecture synthesis and application mapping, Microprocessors and Microsystems 37 (8) (2013) 1002–1019. doi:10.1016/j.micpro.2013.08.006. [2] O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, H. Meyr, Architecture implementation using the machine description language LISA, in: Design Automation Conference, 2002. Pro- ceedings of ASP-DAC 2002. 7th Asia and South Pacific and the 15th International Conference on VLSI Design, IEEE, 2002, pp. 239–244. [3] S. Pees, A. Hoffmann, V. Zivojnovic, H. Meyr, LISA – machine description language for cycle- accurate models of programmable DSP architectures, in: Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, ACM, 1999, pp. 933–938. [4] A. Fauth, J. Van Praet, M. Freericks, Describing instruction set processors using nML, in: European Design and Test Conference, 1995. ED&TC 1995, Proceedings., IEEE, 1995, pp. 503–507. [5] G. Goossens, D. Lanneer, W. Geurts, J. Van Praet, Design of ASIPs in multi-processor socs using the chess/checkers retargetable tool suite, in: System-on-Chip, 2006. International Symposium on, IEEE, 2006, pp. 1–4. [6] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, E. Barros, The archc archi- tecture description language and tools, International Journal of Parallel Programming 33 (5) (2005) 453–484.

[7] J. Podivinsky, O. Cekan, Z. Kot´asek, et al., Fpga prototyping and accelerated verification of asips, in: Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2015 IEEE 18th International Symposium on, IEEE, 2015, pp. 145–148. [8] L. Charvat, A. Smrcka, T. Vojnar, Automatic formal correspondence checking of ISA and RTL microprocessor description, in: Microprocessor Test and Verification (MTV), 2012 13th International Workshop on, IEEE, 2012, pp. 6–12.

29 [9] A. Chattopadhyay, I. G. Ascheid, P. Ienne, Language-driven exploration and implementation of partially re-configurable ASIPs (rasips), Ph.D. thesis, Lehrstuhl f¨urIntegrierte Systeme der Signalverarbeitung (2008). [10] K. Karuri, A. Chattopadhyay, X. Chen, D. Kammler, L. Hao, R. Leupers, H. Meyr, G. As- cheid, A design flow for architecture exploration and implementation of partially reconfig- urable processors, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 16 (10) (2008) 1281–1294. [11] M. Wijtvliet, L. Waeijen, H. Corporaal, Coarse grained reconfigurable architectures in the past 25 years: Overview and classification, in: Proceedings of the 16th International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation (SAMOS), IEEE, 2016.

[12] P. Mishra, N. Dutt, Processor Description Languages, Vol. 1, Morgan Kaufmann, 2011. [13] D. She, Y. He, H. Corporaal, Energy efficient special instruction support in an embedded pro- cessor with compact ISA, in: Proceedings of the 2012 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, ACM, 2012, pp. 131–140.

[14] S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, E. Earlie, Register file power reduction using bypass sensitive compiler, IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems 27 (6) (2008) 1155. [15] H. Corporaal, Transport triggered architectures; design and evaluation, Ph.D. thesis, Tech- nische Universiteit Delft (1995).

[16] J. Hoogerbrugge, Code generation for transport triggered architectures, Ph.D. thesis, Tech- nische Universiteit Delft (1996). [17] J. Hoogerbrugge, H. Corporaal, Automatic synthesis of transport triggered processors, in: Proceedings of ASCI, 1995, pp. 1–10.

[18] Y. He, D. She, B. Mesman, H. Corporaal, Move-pro: a low power and high code density tta architecture, in: Embedded Computer Systems (SAMOS), 2011 International Conference on, IEEE, 2011, pp. 294–301. [19] Tampere University of Technology, Department of Computer Systems, TTA-based Co-design Environment v1.9 User Manual (Januari 2014).

[20] V. Kathail, S. Aditya, R. Schreiber, B. Ramakrishna Rau, D. Cronquist, M. Sivaraman, PICO: automatically designing custom computers, Computer 35 (9) (2002) 39–47. [21] S. Aditya, V. Kathail, High-Level Synthesis: From Algorithm to Digital Circuit, Springer Science, 2008, Ch. Algorithmic Synthesis using PICO, pp. 53–74.

[22] P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, Communication scheduling, in: Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, ACM, New York, NY, USA, 2000, pp. 82–92. doi:10.1145/378993.379005. [23] A. Terechko, Clustered VLIW architectures: a quantative approach, Ph.D. thesis, Eindhoven University of Technology (2007). [24] M. Bekooij, Constraint driven operation assignemnt for retargetable VLIW compilers, Ph.D. thesis, Eindhoven University of Technology (2004). [25]Y. Okmen,¨ SIMD floating point processor and efficient implementation of ray tracing algo- rithm, Master’s thesis, TU Delft, Delft, The Netherlands (October 2011).

30 [26] J. Leijten, G. Burns, J. Huisken, E. Waterlander, A. van Wel, Avispa: A massively parallel reconfigurable accelerator, in: System-on-Chip, 2003. Proceedings. International Symposium on, IEEE, 2003, pp. 165–168. [27] E. van Dalen, S. G. Pestana, A. van Wel, An integrated, low-power processor for image signal processing, in: Multimedia, 2006. ISM’06. Eighth IEEE International Symposium on, IEEE, 2006, pp. 501–508.

[28] M. Lam, Software pipelining: An effective scheduling technique for VLIW machines, ACM SIGPLAN Notices 23 (7) (1988) 318–328. [29] J. E. Stone, D. Gohara, G. Shi, OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in Science & Engineering 12 (3) (2010) 66.

[30] C. Bastoul, Code generation in the polyhedral model is easier than you think, in: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, 2004, pp. 7–16. [31] M. M. Baskaran, J. Ramanujam, P. Sadayappan, Automatic C-to-CUDA code generation for affine programs, in: Compiler Construction, Springer, 2010, pp. 244–263.

[32] T. Grosser, A. Groesslinger, C. Lengauer, Polly: Performing polyhedral optimizations on a low-level intermediate representation, Parallel Processing Letters 22 (04). [33] S. Verdoolaege, T. Grosser, Polyhedral extraction tool, in: Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France, 2012.

[34] U. Bondhugula, A. Hartono, J. Ramanujam, P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices 43 (6) (2008) 101–113. [35] B. Kienhuis, E. Rijpkema, E. Deprettere, Compaan: Deriving process networks from Mat- lab for embedded signal processing architectures, in: Proceedings of the 8th International Workshop on Hardware/Software Codesign, ACM, 2000, pp. 13–17.

[36] R. Jordans, R. Corvino, L. J´o´zwiak,H. Corporaal, Exploring processor parallelism: Esti- mation methods and optimization strategies, in: DDECS 2013 - 16th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems, Karlovy Vary, Czech Republic, 2013, pp. 18–23. [37] R. Jordans, R. Corvino, L. J´o´zwiak,H. Corporaal, Exploring processor parallelism: Esti- mation methods and optimization strategies, International Journal of Microelectronics and Computer Science 4 (2) (2013) 55 – 64. [38] L. Micconi, D. Gangadharan, P. Pop, J. Madsen, Multi-ASIP platform synthesis for real-time applications, in: SIES 2013 - 8th IEEE International Symposium on Industrial Embedded Systems, Porto, Portugal, 2013. doi:10.1109/SIES.2013.6601471. [39] L. Micconi, A probabilistic approach for the system-level design of multi-ASIP platforms, Ph.D. thesis, Technical University of Denmark (2014). [40] R. Corvino, E. Diken, A. Gamatie, L. J´o´zwiak,Transformation based exploration of data parallel architecture for customizable hardware: A JPEG encoder case study, in: DSD 2012 - 15th Euromicro Conference on Digital System Design, Cesme, Izmir, Turkey, 2012.

[41] L. J´o´zwiak,D. Gaweowski, A. Slusarczyk, A. Chojnacki, Static power reduction in nano cmos circuits through an adequate circuit synthesis, in: Mixed Design of Integrated Circuits and Systems, 2007. MIXDES ’07. 14th International Conference on, 2007, pp. 172 –177.

31 [42] A. S. Nery, L. J´o´zwiak,M. Lindwer, M. Cocco, N. Nedjah, F. M. Fran¸ca,Hardware reuse in modern application-specific processors and accelerators, Microprocessors and Microsystems 37 (6) (2013) 684–692. [43] A. S. Nery, N. Nedjah, F. M. Franca, L. Jozwiak, H. Corporaal, Automatic complex instruc- tion identification for efficient application mapping onto ASIPs, in: Circuits and Systems (LASCAS), 2014 IEEE 5th Latin American Symposium on, IEEE, 2014, pp. 1–4.

[44] A. S. Nery, N. Nedjah, F. M. G. Franca, L. J´o´zwiak,H. Corporaal, A framework for automatic custom instruction identification on multi-issue ASIPs, in: Proceedings of the 12th IEEE International Conference on Industrial Informatics, IEEE computer society, IEEE, 2014, pp. 428–433. [45] R. Jordans, R. Corvino, L. J´o´zwiak,H. Corporaal, An efficient method for energy estimation of application specific instruction-set processors, in: DSD 2013 - 16th Euromicro Conference on Digital System Design, Santander, Spain, 2013, pp. 471–474. doi:10.1109/DSD.2013.120. [46] R. Jordans, R. Corvino, L. J´o´zwiak,H. Corporaal, Instruction-set architecture exploration strategies for deeply clustered VLIW ASIPs, in: ECyPS 2013 - EUROMICRO/IEEE Work- shop on Embedded and Cyber-Physical Systems, Budva, Montenegro, 2013, pp. 38–41. doi:10.1109/MECO.2013.6601361. [47] R. Jordans, L. J´o´zwiak,H. Corporaal, Instruction-set architecture exploration of VLIW ASIPs using a genetic algorithm, in: MECO 2014 - 3rd Mediterranean Conference on Embedded Computing, 2014, pp. 32–35. doi:10.1109/MECO.2014.6862720. [48] R. Jordans, E. Diken, L. J´o´zwiak,H. Corporaal, BuildMaster: Efficient ASIP architecture exploration through compilation and simulation result caching, in: DDECS 2014 - 17th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems, 2014. [49] P. Boulet, Array-ol revisited, multidimensional intensive signal processing specification, Rap- port de Recherche Institut National de Recherche en Informatique et en Automatique 6113 (2007) 1–27.

[50] R. Corvino, A. Gamatie, M. Geilen, L. Jozwiak, Design space exploration in application- specific hardware synthesis for multiple communicating nested loops, in: SAMOS XII - 12th International Conference on Embedded Computer Systems, Samos, Greece, 2012.

32 Roel Jordans received both the MSc and PhD degrees in field of Electrical Engineering from Eindhoven University of Technology in 2009 and 2015 re- spectively. He worked within the PreMaDoNA project on the MAMPS tool flow as a researcher afterwards. His dissertation focussed on the automatic design space exploration of VLIW ASIP instruction-set architecture within the ASAM project. His research interest include compilers and compilation techniques for application specific systems, digital signal processing systens based on customized VLIW architectures, and reliability and fault-tolerant design for space applications. Currently he is employed at the Radboud University Nijmegen where he is active as science DSP architect in the Rad- boud Radio Lab, working on the software-defined radio astronomy receiver for the NCLE mission. In parallel he is the primary lecturer of the parallelization, compilation, and platforms course at Eindhoven University of Technology. He also serves as a program committee member for the EUROMICRO Symposium on Digital System Design.

Lech Jozwiak is an Associate Professor, Head of the Section of Digital Cir- cuits and Formal design Methods, at the Faculty of Electrical Engineering, Eindhoven University of Technology, The Netherlands. He is an author of a new information driven approach to digital circuits synthesis, theories of in- formation relationships and measures and general decomposition of discrete relations, and methodology of quality driven design that have a considerable practical importance. He is also a creator of a number of practical prod- ucts in the fields of application-specific embedded systems and EDA tools. His research interests include system, circuit, information theory, artificial intelligence, embedded systems, re-configurable and parallel computing, de- pendable computing, multi-objective circuit and system optimization, and system analysis and validation. He is the author of more than 150 journal and conference papers, some book chap- ters, and several tutorials at international conferences and summer schools. He is an Editor of “Microprocessors and Microsystems”, “Journal of Systems Architecture” and “International Jour- nal of High Performance Systems Architecture”. He is a Director of EUROMICRO; co-founder and Steering Committee Chair of the EUROMICRO Symposium on Digital System Design; Ad- visory Committee and Organizing Committee member in the IEEE International Symposium on Quality Electronic Design; and program committee member of many other conferences. He is an advisor and consultant to the industry, Ministry of Economy and Commission of the European Communities. He recently advised the European Commission in relation to Embedded and High- performance Computing Systems for the purpose of the Framework Program 7 preparation. In 2008 he was a recipient of the Honorary Fellow Award of the International Society of Quality Electronic Design for “Outstanding Achievements and Contributions to Quality of Electronic De- sign”. His biography is listed in “The Roll of Honour of the Polish Science” of the Polish State Committee for Scientific Research and in Marquis “Who is Who in the World” and “Who is Who in Science and Technology”.

33 Henk Corporaal received the M.S. degree in theoretical physics from the University of Groningen, Groningen, The Netherlands, and the Ph.D. de- gree in electrical engineering, in the area of computer architecture, from the Delft University of Technology, Delft, The Netherland. He has been teaching at several schools for higher education. He has been an Associate Professor with the Delft University of Technology in the field of computer architecture and code generation. He was a Joint Professor with the National Univer- sity of Singapore, Singapore, and was the Scientific Director of the joint NUS-TUE Design Technology Institute. He was also the Department Head and Chief Scientist with the Design Technology for Integrated Information and Communication Systems Division, IMEC, Leuven, Belgium. Currently, he is a Professor of embedded system architectures with the Eindhoven University of Technology, Eindhoven, The Netherlands. He has co-authored over 250 journal and conference papers in the (multi)processor architecture and embedded system design area. Furthermore, he invented a new class of very long instruction word architectures, the Transport Triggered Architectures, which is used in several commercial products and by many research groups. His current research interests include single and multiprocessor architectures and the predictable design of soft and hard real-time embedded systems.

Rosilde Corvino is a software engineer at Intel Benelux B.V., Eindhoven. Before that, she worked as scientist and project manager in the Electronic Systems group at the Department of Electrical Engineering at Eindhoven Technical University, The Netherlands. In 2010, she was a post-doctoral research fellow in DaRT team at INRIA Lille Nord Europe and was in- volved in Gaspard2 project. She earned her PhD in 2009, from University Joseph Fourier of Grenoble, in micro and nanoelectronics. In 2005/06, she obtained a double Italian and French M.Sc in electronic engineer. Her re- search interests involve design space exploration, parallelization techniques, data transfer and storage mechanisms, high level synthesis, application spe- cific processor design for data intensive applications. She is author of numerous research papers and a book chapter.

34