Journal of Electrical and Computer Engineering
ESL Design Methodology
Guest Editors: Deming Chen, Kiyoung Choi, Philippe Coussy, Yuan Xie, and Zhiru Zhang ESL Design Methodology Journal of Electrical and Computer Engineering
ESL Design Methodology
Guest Editors: Deming Chen, Kiyoung Choi, Philippe Coussy, Yuan Xie, and Zhiru Zhang Copyright © 2012 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in “Journal of Electrical and Computer Engineering.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, pro- vided the original work is properly cited. Editorial Board
The editorial board of the journal is organized into sections that correspond to the subject areas covered by the journal.
Circuits and Systems
M. T. Abuelma’atti, Saudi Arabia Yong-Bin Kim, USA Gabriel Robins, USA Ishfaq Ahmad, USA H. Kuntman, Turkey Mohamad Sawan, Canada Dhamin Al-Khalili, Canada Parag K. Lala, USA Raj Senani, India Wael M. Badawy, Canada Shen-Iuan Liu, Taiwan Gianluca Setti, Italy Ivo Barbi, Brazil Bin-Da Liu, Taiwan Jose Silva-Martinez, USA Martin A. Brooke, USA Joao˜ A. Martino, Brazil Nicolas Sklavos, Greece Chip Hong Chang, Singapore Pianki Mazumder, USA Ahmed M. Soliman, Egypt Y. W. Chang, Taiwan Michel Nakhla, Canada Dimitrios Soudris, Greece Tian-Sheuan Chang, Taiwan Sing K. Nguang, New Zealand Charles E. Stroud, USA Tzi-Dar Chiueh, Taiwan Shun-ichiro Ohmi, Japan Ephraim Suhir, USA Henry S. H. Chung, Hong Kong Mohamed A. Osman, USA Hannu Tenhunen, Sweden M. Jamal Deen, Canada Ping Feng Pai, Taiwan George S. Tombras, Greece Ahmed El Wakil, UAE Marcelo A. Pavanello, Brazil Spyros Tragoudas, USA Denis Flandre, Belgium Marco Platzner, Germany Chi Kong Tse, Hong Kong P. Franzon, USA Massimo Poncino, Italy Chi-Ying Tsui, Hong Kong Andre Ivanov, Canada Dhiraj K. Pradhan, UK Jan Van der Spiegel, USA Ebroul Izquierdo, UK F. Ren, USA Chin-Long Wey, USA Wen-Ben Jone, USA Communications
Sofiene` Affes, Canada K. Giridhar, India Adam Panagos, USA Dharma Agrawal, USA Amoakoh Gyasi-Agyei, Ghana Samuel Pierre, Canada H. Arslan, USA Yaohui Jin, China Nikos C. Sagias, Greece Edward Au, China Mandeep Jit Singh, Malaysia John N. Sahalos, Greece Enzo Baccarelli, Italy Peter Jung, Germany Christian Schlegel, Canada Stefano Basagni, USA Adnan Kavak, Turkey Vinod Sharma, India Guoan Bi, Singapore Rajesh Khanna, India Iickho Song, Korea Jun Bi, China Kiseon Kim, Republic of Korea Ioannis Tomkos, Greece Z. Chen, Singapore D. I. Laurenson, UK Chien Cheng Tseng, Taiwan Rene´ Cumplido, Mexico Tho Le-Ngoc, Canada George Tsoulos, Greece Luca De Nardis, Italy C. Leung, Canada Laura Vanzago, Italy M.-Gabriella Di Benedetto, Italy Petri Mah¨ onen,¨ Germany Roberto Verdone, Italy J. Fiorina, France Mohammad A. Matin, Bangladesh Guosen Yue, USA Lijia Ge, China M. Najar,´ Spain Jian-Kang Zhang, Canada Z. Ghassemlooy, UK M. S. Obaidat, USA Signal Processing
S. S. Agaian, USA A. Constantinides, UK Karen O. Egiazarian, Finland Panajotis Agathoklis, Canada Paul Cristea, Romania W. S. Gan, Singapore Jaakko Astola, Finland Petar M. Djuric, USA Z. F. Ghassemlooy, UK Tamal Bose, USA Igor Djurovic,´ Montenegro Ling Guan, Canada Martin Haardt, Germany S. Marshall, UK Srdjan Stankovic, Montenegro Peter Handel, Sweden Antonio Napolitano, Italy Yannis Stylianou, Greece Alfred Hanssen, Norway Sven Nordholm, Australia Ioan Tabus, Finland Andreas Jakobsson, Sweden S. Panchanathan, USA Jarmo Henrik Takala, Finland Jiri Jan, Czech Republic Periasamy K. Rajan, USA A. H. Tewfik, USA S. Jensen, Denmark Cedric´ Richard, France Jitendra Kumar Tugnait, USA Chi Chung Ko, Singapore W. Sandham, UK Vesa Valimaki, Finland M. A. Lagunas, Spain Ravi Sankar, USA Luc Vandendorpe, Belgium J. Lam, Hong Kong Dan Schonfeld, USA Ari J. Visa, Finland D. I. Laurenson, UK Ling Shao, UK JarFerrYang,Taiwan Riccardo Leonardi, Italy John J. Shynk, USA Mark Liao, Taiwan Andreas Spanias, USA Contents
ESL Design Methodology, Deming Chen, Kiyoung Choi, Philippe Coussy, Yuan Xie, and Zhiru Zhang Volume 2012, Article ID 358281, 2 pages
Parametric Yield-Driven Resource Binding in High-Level Synthesis with Multi-Vth/Vdd Library and Device Sizing, Yibo Chen, Yu Wang, Yuan Xie, and Andres Takach Volume 2012, Article ID 105250, 14 pages
Task-Level Data Model for Hardware Synthesis Based on Concurrent Collections,JasonCong, Karthik Gururaj, Peng Zhang, and Yi Zou Volume 2012, Article ID 691864, 24 pages
Selectively Fortifying Reconfigurable Computing Device to Achieve Higher Error Resilience, Mingjie Lin, Yu Bai, and John Wawrzynek Volume 2012, Article ID 593532, 12 pages
A State-Based Modeling Approach for Efficient Performance Evaluation of Embedded System Architectures at Transaction Level, Anthony Barreteau, Sebastien´ Le Nours, and Olivier Pasquier Volume 2012, Article ID 537327, 16 pages
High-Level Synthesis under Fixed-Point Accuracy Constraint, Daniel Menard, Nicolas Herve, Olivier Sentieys, and Hai-Nam Nguyen Volume 2012, Article ID 906350, 14 pages
Automated Generation of Custom Processor Core from C Code, Jelena Trajkovic, Samar Abdi, Gabriela Nicolescu, and Daniel D. Gajski Volume 2012, Article ID 862469, 26 pages
Hardware and Software Synthesis of Heterogeneous Systems from Dataflow Programs, Ghislain Roquier, Endri Bezati, and Marco Mattavelli Volume 2012, Article ID 484962, 11 pages
High-Level Synthesis: Productivity, Performance, and Software Constraints, Yun Liang, Kyle Rupnow, Yinan Li, Dongbo Min, Minh N. Do, and Deming Chen Volume 2012, Article ID 649057, 14 pages Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2012, Article ID 358281, 2 pages doi:10.1155/2012/358281
Editorial ESL Design Methodology
Deming Chen,1 Kiyoung Choi,2 Philippe Coussy,3 Yuan Xie, 4 and Zhiru Zhang5
1 Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 2 School of Electrical Engineering, Seoul National University, Seoul 151-742, Republic of Korea 3 Department of Sciences and Techniques, Lab-STICC, Universit´e de Bretagne-Sud, Lorient 56321, Cedex, France 4 Department of Computer Science and Engineering, Pennsylvania State University at University Park, University Park, PA 16802-1294, USA 5 High-level Synthesis Department, Xilinx Inc., San Jose, CA 95124, USA
Correspondence should be addressed to Deming Chen, [email protected]
Received 15 May 2012; Accepted 15 May 2012 Copyright © 2012 Deming Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction special issue. These papers cover a wide range of important topics for ESL with rich content and compelling experimen- ESL (electronic system level) design is an emerging design tal results. We introduce the summaries of these papers next. methodology that allows designers to work at higher levels They are categorized into four sections: high-level synthesis, of abstraction than typically supported by register transfer modeling, processor synthesis and hardware/software co- level (RTL) descriptions. Its growth has been driven by the design, and design for error resilience. continuing complexity of IC design, which has made RTL implementation less efficient. 2. High-Level Synthesis ESL methodologies hold the promise of dramatically improving design productivity by accepting designs written In the paper “Parametric yield-driven resource binding in- in high-level languages such as C, System C, C++, and high-level synthesis with multi-Vth/Vdd library and device MATLAB, and so forth, and implementing the function sizing” Y. Chen et al. demonstrated that the increasing impact straight into hardware. Designers can also leverage ESL to of process variability on circuit performance and power optimize performance and power by converting compute requires the employment of statistical approaches in analyses intensive functions into customized cores in System-on-Chip and optimizations at all levels of design abstractions. This (SoC) designs or FPGAs. It can also support early embedded- paper presents a variation-aware high-level synthesis method software development, architectural modeling, and func- that integrates resource sharing with Vth/Vdd selection and tional verification. device sizing to effectively reduce the power consumption ESL has been predicted to grow in both user base and rev- under given timing yield constraint. Experimental results enue steadily in the coming decade. Meanwhile, the design demonstrate significant power yield improvement over con- challenges in ESL remain. Some important research chal- ventional worst-case deterministic techniques. lenges include effective hardware/software partitioning and D. Menard et al. present in the paper “High-level synthesis co-design, high-quality high-level synthesis, seamless system under fixed-point accuracy constraint”anewmethodto IP integration, accurate and fast performance/power model- integrate high level synthesis (HLS) and word-length opti- ing, and efficient debugging and verification, and so forth. misation (WLO). The proposed WLO approach is based on With the invitation of JournalofElectricalandComputer analytical fixed-point analysis to reduce the implementation Engineering of the Hindawi Publishing Corporation, we cost of signal processing applications. Authors demonstrate started the effort of putting together a special issue on ESL that area savings can be obtained by iteratively performing design methodology. After call for papers, we received sub- WLO and HLS, in the case of a latency constrained applica- missions from around the globe, and after a careful review tion, by taking advantage of the interactions between these and selection procedure, eight papers are accepted into this two processes. 2 Journal of Electrical and Computer Engineering
In “Highlevel synthesis: productivity, performance, and paradigm. The work targets heterogeneous architectures software constraints” by Y. Liang et al., a study of HLS tar- composed by reconfigurable hardware units and multicore geting FPGA in terms of performance, usability, and pro- processors. Experimental results using several video coding ductivity was presented. For the study, the authors use an algorithms show the effectiveness of the approach both in HLS tool called AutoPilot and a set of popular-embedded terms of portability and scalability. benchmark kernels. To evaluate the suitability of HLS on real-world applications, they also perform a case study using 5. Design for Error Resilience various stereo matching algorithms used in computer vision research. Through the study, they provide insights on current The paper “Selectively fortifying reconfigurable computing limitations of mapping general purpose software to hardware device to achieve higher error resilience” by M. Lin. et al. intro- using HLS and some future directions for HLS tool devel- duces the concept of Selectively Fortified Computing (SFC) opment. They also provide several guidelines for hardware for mission-critical applications with limited inherent error friendly software design. resilience. The SFC methodology differs from the conven- tional approaches that use static temporal and/or spatial 3. Modeling redundancy and complicated error prediction or estimation techniques. It selectively allocates hardware redundancy for In “Task-level data model for hardware synthesis based on con- the key components of a design in order to maximize its current collections” by J. Cong et al., a task-level data model overall error resilience. The experimental results from a 720P (TLDM), which can be used for task-level optimization in H.264/AVC encoder prototype implemented with a Virtex hardware synthesis for data processing applications, was pro- 5 device demonstrated the effectiveness of SFC operating posed. The model is based on the Concurrent Collection under a wide range of error rates. model that can provide flexibility in task rescheduling. Poly- hedral models are embedded in TLDM for concise expres- 6. Concluding Remarks sion of task instances, array accesses, and dependencies. The authors demonstrate examples to show the benefits of the ESL design area is a fast evolving field. We hope this special proposed TLDM specification in modeling task level concur- issue would provide a snapshot of the current research activ- rency for hardware synthesis in heterogeneous platforms. ities in ESL design area and offer useful references for A. Barreteau et al. in the paper “Astate-basedmodeling researchers who are interested in this exciting field. Finally, approach for efficient performance evaluation of embedded we would like to thank all the 29 reviewers for this special system architectures at transaction level” address the impor- issue wholeheartedly, whose effort has made this special issue tant topic of performance evaluation for SoC based on successful. transaction-level modeling (TLM). The authors propose a generic execution model and a specific computation method Deming Chen to support hardware/software architecture evaluation. The Kiyoung Choi benefits of the proposed approach are highlighted through Philippe Coussy two case studies. Yuan Xie Zhiru Zhang 4. Processor Synthesis and Hardware/Software Codesign The paper “Automated generation of custom processor core from c code” by J. Trajkovic et al. presents a novel solution to constructing a processor core from a given application C code. The proposed methodology starts with an initial data path design by matching code properties to hardware elements and iteratively refines it under given design con- straints. The experimental results show that the technique scales very well with the size of the C code, and demonstrate the efficiency of the technique on wide range of applications, from standard academic benchmarks to industrial size examples like the MP3 decoder. The paper “Hardware and software synthesis of hetero- geneous systems from dataflow programs” by G. Roquier et al. stresses that sequential programming model does not naturally expose potential parallelism to target heterogene- ous platforms. Therefore, this work presents a design meth- od that automatically generates hardware and software components and their interfaces, from a unique high- level description of the application, based on the dataflow Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2012, Article ID 105250, 14 pages doi:10.1155/2012/105250
Research Article Parametric Yield-Driven Resource Binding in High-Level Synthesis with Multi-Vth/Vdd Library and Device Sizing
Yibo Chen,1 Yu Wang, 2 Yuan Xie, 1 and Andres Takach3
1 Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA 2 Department of Electronics Engineering, Tsinghua University, Beijing 100084, China 3 Design Creation and Synthesis, Mentor Graphics Corporation, Wilsonville, OR 97070, USA
Correspondence should be addressed to Yibo Chen, [email protected]
Received 3 August 2011; Revised 4 January 2012; Accepted 15 January 2012
Academic Editor: Zhiru Zhang
Copyright © 2012 Yibo Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The ever-increasing chip power dissipation in SoCs has imposed great challenges on today’s circuit design. It has been shown that multiple threshold and supply voltages assignment (multi-Vth/Vdd)isaneffective way to reduce power dissipation. However, most of the prior multi-Vth/Vdd optimizations are performed under deterministic conditions. With the increasing process variability that has significant impact on both the power dissipation and performance of circuit designs, it is necessary to employ statistical approaches in analysis and optimizations for low power. This paper studies the impact of process variations on the multi-Vth/Vdd technique at the behavioral synthesis level. A multi-Vth/Vdd resource library is characterized for delay and power variations at different voltage combinations. Meanwhile, device sizing is performed on the resources in the library to mitigate the impact of variation, and to enlarge the design space for better quality of the design choice. A parametric yield-driven resource binding algorithm is then proposed, which uses the characterized power and delay distributions and efficiently maximizes power yield under a timing yield constraint. During the resource binding process, voltage level converters are inserted between resources when required. Experimental results show that significant power reduction can be achieved with the proposed variation-aware framework, compared with traditional worstcase based deterministic approaches.
1. Introduction evidenced in recent conference HLS workshops (DATE2008, DAC2008, and ASPDAC2009), conference panels, and pub- Integrating billions of transistors on a single chip with na- lications that track the industry. While high-level synthesis noscale transistors has resulted in great challenges for chip is able to quickly generate implementations of circuits, it designers. One of these challenges is that the pace of pro- is not intended to replace the existing low-level synthesis. ductivity gains has not kept up to address the increases in The major benefit coming from high-level synthesis is the design complexity. Consequently, we have seen a recent trend high design efficiency, the ability to perform fast prototyping, of moving design abstraction to a higher level, with an functional verification, and early-stage design space explo- emphasis on electronic system level (ESL) design method- ration, which in turn provide guidance on succeeding low- ologies. A very important component of ESL is raising the level design steps and help produce high-quality circuits. level of abstraction of hardware design. High-level synthesis Power consumption and process variability are among (HLS) provides this component by providing automation to other critical design challenges as technology scales. While generate optimized hardware from a high-level description it is believed that tackling these issues at a higher level of the of the function or algorithm to be implemented in hardware. design hierarchy can lead to better design decisions, a lot of HLS generates a cycle-accurate specification at the register- work has been done on low-power-high-level synthesis [2– transfer level (RTL) that is then used in existing ASIC 4] as well as process-variation-aware-high-level synthesis [5– or FPGA design methodologies. Commercial high-level 8]. These techniques have been successfully implemented but synthesis tools [1] have recently gained a lot of attention as most of the existing work focuses on one side of the issues 2 Journal of Electrical and Computer Engineering in isolation. Recently, Srivastava et al. [9] explore the multi- binding process, voltage level converters are inserted for Vth/Vdd/Tox design space with the consideration of process chaining of resource units having different Vdd supplies. variations at the gate level. Nevertheless, variation-aware- The contribution of this paper can be summarized as low power exploration for behavioral synthesis is still in its follow: infancy. (i) first, this is the first work to apply multi-V /V Multiple threshold and supply voltages assignment th dd ff techniques during high-level synthesis under the (multi-Vth/Vdd) has been shown as an e ective way to reduce context of both delay and power variations. A flow circuit power dissipation [2, 3, 10, 11]. Existing approaches for variation-aware power optimization in multi- assign circuit components on critical paths to operate at a Vth/Vdd HLS is proposed. This flow includes library higher Vdd or lower Vth, and noncritical portions of the characterization, statistical timing and power analysis circuit are made to operate at lower Vdd or higher Vth, methodologies for HLS, and resource binding opti- respectively. The total power consumption is thus reduced mization with variation-characterized multi-Vth/Vdd without degrading circuit performance. However, nowadays library; circuit performance is affected by process variations. If the variations are underestimated, for example, using nominal (ii) combined multi-Vth/Vdd assignment and device siz- delays of circuit components to guide the design, non-critical ing for high-level synthesis are performed at the gran- components may turn to critical ones due to the variations, ularity of function unit level, to improve the design and circuit timing constraints may be violated. On the other quality and at the same time to reduce the design hand, in existing corner-based worst-case analysis, variations complexity; are overestimated resulting in design specs that are hard (iii) voltage level conversion is explored during the to meet, and this consequently increases design effort and resource binding in high-level synthesis, enabling the degrades circuit performance. full utilization of multi-Vdd components for para- Device sizing is a well-studied technique for perfor- metric yield maximization. mance and power tuning at gate- or circuit-level [12]. To improve performance, upsizing of a high-Vth transistor, 2. Related Work which increases switching power and die area, can be traded off against using a low-Vth transistor, which increases leakage Prior research work tightly related to this paper mainly falls power. Therefore, combining multi-Vth assignment and into two categories: (1) gate level power minimization by device sizing as integrated problem, can increase the design simultaneous multi-Vth assignment and gate sizing; (2) low- flexibility and further improve the design quality. Meanwhile, power high-level synthesis using multi-Vth or multi-Vdd;(3) in terms of mitigating process variations, it is possible that process variation aware high-level synthesis. increasing the size of transistors can reduce the randomness Several techniques were proposed to consider Vth alloca- of the device parameters through averaging. tion and transistor sizing as an integrated problem [13–16]. This paper presents a variation-aware power optimiza- Wei et al. [14] presented simultaneous dual-Vth assignment tion framework in high-level synthesis using simultaneous and gate sizing to minimize the total power dissipation multi-Vth/Vdd assignment and device sizing. Firstly, the while maintaining high performance, while Karnik et al. impact of parameter variations on the delay and power of [16] improved the simultaneous Vth allocation and device circuit components is explored at different operating points sizing using a Lagrangian Relaxation method. However, all of threshold and supply voltages. Device sizing is then of the reported techniques focus on tuning at transistor level performed to mitigate the impact of variations and to enlarge or gate-level. While the fine granularity can yield optimal the design space for better quality of the design choice. A results, it also lead to high design complexity. variation-characterized resource library containing the pa- Shiue [2] proposed low-power scheduling schemes with rameters of delay and power distributions at different voltage multi-Vdd resources by maximizing the utilization of “corners” and different device sizes, is built once for the resources operating at reduced supply voltages. Khouri and given technology, so that it is available for high-level syn- Jha [3] performed high-level synthesis using a dual-Vth thesis to query the delay/power characteristics of resources. library for leakage power reduction. Tang et al. [4]formu- The concept of parametric yield, which is defined as the lated the synthesis problem using dual-Vth as a maximum probability that the design meets specified constraints such as weight-independent set (MWIS) problem, within which delay or power constraints is then introduced to guide design near-optimal leakage power reduction is achieved with space exploration. Statistical timing and power analysis on greatly reduced run time. Very recently, Insup et al. explored the data flow graph (DFG) is used to populate the delay optimal register allocation for high-level synthesis using and power distributions through the DFG and to estimate dual supply voltages [17]. However, all of these techniques the overall performance and power yield of the entire were applied under deterministic conditions without taking design. A variation-aware resource binding algorithm is then process variation into consideration. proposed to maximize power yield under a timing yield Process variation-aware high-level synthesis has recently constraint, by iteratively searching for the operations that gained much attention. Jung and Kim [6] proposed a timing have the maximum potential of performance/power yield yield-aware HLS algorithm to improve resource sharing and improvement, and replacing them with better candidates reduce overall latency. Lucas et al. [8] integrated timing- in the multi-Vth/Vdd resource library. During the resource driven floorplanning into the variation-aware high-level Journal of Electrical and Computer Engineering 3
design. Mohanty and Kougianos’s work [18] took into ) 0.12 μ / account the leakage power variations in low-power high-level σ synthesis; however, the major difference between [18]and 0.1 our work is that, the delay variation of function units was not 0.08 considered in [18], so the timing analysis during synthesis was still deterministic. Recently, Wang et al. [19] proposed 0.06 a joint design-time optimization and postsilicon tuning framework that tackles both timing and power variations. 0.04 Adaptive body biasing (ABB) was applied to function units 0.02 to reduce leakage power and improve power yield. Normalized delay variability delay ( Normalized 0 static 3. Multi-Vth/Vdd Library Characterization static passgate dynamic dynamic Brent Kung Brent R2 passgate Manchester Manchester Carry select Carry select Carry select R2 dynamic Kogge Stone Kogge Stone Kogge Stone Kogge Stone Han Carlson Han under Process Variations radix 2 static radix 4 static Scheduling and resource binding are key steps during the Figure 1: The delay variation for 16-bit adders in IBM Cu-08 tech- high-level synthesis process. The scheduler is in charge of nology (courtesy of IBM). determining the sequencing the operations of a control/data flow graph (CDFG) in control steps and within control steps (operator chaining) while obeying control and data Gate delay D = f (P) dependencies and cycle constraints while optimizing for area/power/performance. The binding process binds opera- Distribution of D tions to hardware units in the resource library to complete the mapping from abstracted descriptions of circuits into Type D practical designs. This section presents the characteriza- tion of the variation-aware multi-Vth/Vdd resource library, including the delay and power characterization flow and the selection of dual threshold and supply voltages. Distribution of P 3.1. Variation-Aware Library Characterization Flow. In order to facilitate the design space exploration while considering process variations, the resource library of functional units for HLS has to be characterized for delay/power variations. As Lower Type Higher Parameter shown in Figure 1, under the influence of process variations, P the delay and power of each component are no longer fixed Figure 2: Calculating the delay distribution as a function of param- values, but represented by probability density functions eter P. (PDFs). Consequently, the characterization of function units with delay and power variations requires statistical analysis methodologies. Process variations come from a set of sources, including characterization. This variation-aware tool increases the random doping fluctuation (RDF) [20]andgeometric accuracy of timing analysis by considering the statistical variations of the gate (primarily on channel length) [21]. distribution of device parameters such as threshold voltage Since both RDF and channel length variations manifest and gate oxide thickness. Given the distribution of a device themselves as fluctuations on the effective threshold voltage parameter P, PrimeTime VX calculates the distribution of of the transistor [22], their effects can be expressed by the gate delay continuously throughout the range of values, using variations of Vth. Since this work focuses on demonstrating linear interpolation and extrapolation at the library-defined the effectiveness of variation-aware synthesis, rather than functional operating points, as shown in Figure 2. Validation a comprehensive modeling of all variation effects, we try against SPICE Monte Carlo statistical analysis [25] shows to focus on Vth variations with a simplified assumption of that PrimeTime VX analysis holds the similar accuracy but normal distribution of Vth variations, rather than covering all significantly reduces the running time. physical-level variation factors with different distributions. The characterization flow takes as input the statistical The magnitude of Vth variations in real circuits can be distributions of process parameters (Vth in this work) and obtained via on-chip sensing and measurement. In this work, generates the statistical distributions of delay and power for we use NCSU FreePDK 45 nm technology library [23]forall each resource in the library. To characterize the delay of the characterization and experiments. We set the standard function units under the impact of process variations, the deviation σ of Vth to be 50 mV, which is projected from the following steps are performed: silicon measurement data in [24]. We then use a commercial gate-level statistical timing (1) all the standard cells in a technology library are char- analysis tool, Synopsys PrimeTime VX to perform the acterized using variation-aware analysis, and the 4 Journal of Electrical and Computer Engineering
results including parameters of cell delay distribu- Note that in our work we only characterize subthreshold tions are collected to build a variation-aware technol- leakage, since it starts dominant for technology nodes of ogy library; 45 nm and below. The gate leakage can also be characterized (2) the function units used in HLS are then synthesized with similar methods. and linked to the variation-aware technology library; 3.2. Multi-Vth/Vdd Library Characterization. Previous imple- (3) statistical timing analysis for the function units is mentations using multiple threshold and supply voltages in performed using PrimeTime VX, and the parameters conjunction have shown a very effective reduction in both of delay distributions are reported. dynamic and leakage power [11]. Therefore, our approach Statistical power characterization for function units in considers the combination of dual threshold and dual supply voltages, and characterizations are performed at the four the resource library can be done using Monte Carlo analysis L H H H in SPICE. The power consumption of function units consists “corners” of voltage settings, namely (Vth, Vdd), (Vth , Vdd), L L H L L H of dynamic and leakage components. While dynamic power (Vth, Vdd), and (Vth , Vdd), where (Vth, Vdd) is the nominal is relatively immune to process variation, leakage power case and the other three are low-power settings. Note that is greatly affected and becomes dominant as technology although only 4 voltage settings are discussed in this paper, it continues scaling down [26]. Therefore, in this paper only is natural to extend the approach presented here to deal with leakage power is characterized using statistical analysis. more voltage settings. To reduce the process technology cost, However, this do not mean that considerations for dynamic in this paper, the multi-Vth/Vdd techniques are applied at the power can be omitted. In fact, dynamic power optimization granularity of function units. That means, all the gates inside in high-level synthesis has been a well-explored topic [1]. a function unit operate at the same threshold and supply Our variation-oriented work emphasizing leakage power voltages. Voltages only differ from function units to function optimization can be stacked on or integrated into existing units. power optimization approaches in high-level synthesis, to The selection of appropriate values of threshold and further reduce the total power consumption of circuits. supply voltages for power minimization has been discussed According to Berkeley short-channel BSIM4 model [27], under deterministic conditions [11]. Rules of thumb are higher threshold voltages can lead to exponential reduction derived for the second Vdd and Vth as functions of the origi- in leakage power, which is given by: nalvoltages[11]: V − V − L H H 0.72 0.55 = gs th − Vds V = 0.43V +0.82V + − − 0.2, Ileakage I0 exp 1 exp , (1) dd dd th K K 2 nVt Vt (4) L = − H H 0.72 − 0.49 − W Vth 0.024Vdd +1.14Vth + 2 0.18, I = μ C (n − 1)V 2 , (2) K K 0 0 ox L th where K stands for the initial ratio between dynamic and where Ileakage is the gate leakage current, Vgs and Vds are static power. While the empirical models in [11] are validated the gate voltages, Vt is the thermal voltage, and n is the onactualcircuitbenchmarks[29], they may not be accurate subthreshold swing factor. Since we assume that the device under the impact of process variations. A refined model parameter Vth follows normal distribution, Ileakage follow taking into account the process variations is presented in log-normal distribution. Therefore, the leakage power of a [9]. As shown in Figure 3, the total power reduction with function units is the sum of a set of log-normal distributions, variation awareness is plotted under different combinations H L which describe the leakage power of each library cell. of Vth2 (Vth )andVdd2 (Vdd), and this guides the optimal According to [28], the sum of several log-normal random value selection in this work. variables can be approximated by another log-normal ran- The characterization results (which will be further dis- dom variable, as shown in (3): cussed in Section 6) show that, power reduction is always achieved at the cost of delay penalties. Moreover, larger delay V1 V2 Vn PFU = P1 + P2 + ···+ Pn = k1e + k2e + ···+ kne , variations are observed for slower units operating at high- (3) Vth or low-Vdd, which means larger probability of timing violations when they are placed on the near-critical paths. where PFU describes the leakage power distribution of the This further demonstrates the necessity of statistical analysis function unit; while Pn, kn,andVn are the corresponding and parametric yield-driven optimization approaches. variables for library cells that build up the function unit. The mean and deviation of PFU can be estimated via iterative 3.3. Device Sizing for the Resource Library. Conventionally, moment matching out of the leakage power distributions of device sizing is an effective technique to optimize CMOS library cells [28]. circuits for dynamic power dissipation and performance. In The power characterization flow is stated as follows. this work, we show that device sizing may also be utilized Process variations are set in the MOS model files, and to mitigate the impact of process variations. As previously 1000 runs of Monte Carlo iterations are performed for each mentioned, the sources of process variations mainly consists library cell. After the characterization, the parameters of the of random doping fluctuation (RDF) [20]andgeometric leakage power distributions of library cells are extracted. variations (GVs). GV affect the real Vth through the drain Journal of Electrical and Computer Engineering 5 induced barrier lowering (DIBL) effect. Originally, both RDF and GV have almost the equal importance in determining σ V = 1 3 ( th) 30% the Vth variance.Asweproposetouselow-Vdd and high-Vth resource units in the design, the difference between supply voltage and threshold voltage diminishes, and this reduces 0.9 DIBL effect. As a result, the uncertainty in Vth arising from GV rapidly falls as Vdd. On the other hand, the RDF-induced 0.8 Vth variation is independent of Vdd changes and solely a function of channel area [30]. Therefore, Vth variation result- 0.7 ing from RDF becomes dominating as Vdd approaches Vth.
Due to the independent nature of RDF variations, it mean power total Normalized 0.6 is possible to reduce their impact on circuit performance 0.2 through averaging. Therefore, upsizing the device can be an 0.25 effective way for variability mitigation with enlarged channel 0.3 V 0.35 0.7 area. According to [31], Vth variance σVth resulting from RDF th − 2 0.8 is roughly proportional to (WL) 1/2, which means we can (V) 0.4 0.9 0.45 1 (V) either increase the transistor width or channel length or both. 1.1 Vdd2 Conventional sizing approaches focus on tuning the transis- 1.2 tor width for performance. In terms of process variability Figure 3: Optimal selection of dual threshold and supply voltages mitigation, the measurement data of Vth variation for 4 under process variation. different device sizes is plotted in Figure 4, which shows that increasing transistor width is a more effective way to reduce the Vth variance [24]. Although larger transistor width means larger leakage power, the fluctuations on leakage power 40 are reduced, and the design space for resource binding is significantly enlarged, thus using resources with larger size in 35 the design may still be able to improve the parametric yield. In this work, we upsize all the function units in the 30 resource library to generate alternatives for power tuning and 25 variability mitigation. The sizing is performed on all the gates with two different settings: the basic size (1W1L) and the 20
double-width size (2W1L). We then perform the variation (mV)variance characterization for the upsized function units under all the th 15 V four voltage “corners” presented in the previous section. The 10 characterization results will be presented in Section 6. 5 2
4. Yield Analysis in Statistical 0 1 High-Level Synthesis 1 Normalized gate width2 In this section, a parametric yield analysis framework for Normalized gate length statistical HLS is presented. We first show the necessity of ff statistical analysis by a simple motivational example and then Figure 4: Vth variations with di erence device sizes (normalized to demonstrate the statistical timing and power analysis for HLS minimal size). as well as the modeling and integration of level converters for multi-Vdd HLS.
follows μD(R1) <μD(R2) <μD(R3), which is as expected 4.1. Parametric Yield. To bring the process-variation aware- since power reduction usually comes at the cost of increased ness to the high-level synthesis flow, we first introduce a delay. The clock cycle time TCLK and the power consumption new metric called parametric yield. The parametric yield constraint PLMT (e.g., the TDP (thermal design power) of is defined as the probability of the synthesized hardware most modern microprocessors) are also shown in the figure. meeting a specified constraint Yield = P(Y ≤ Ymax), where If the variation is disregarded and nominal-case analysis is Y can be delay or power. used, any of the resource units can be chosen since they Figure 5 shows a motivational example of yield-aware all meet timing. In this case, R3 would be chosen as it has analysis. Three resource units R1, R2, and R3 have the the lowest power consumption. However, under a statistical same circuit implementation but operate at different supply point of view, R3 has a low timing yield (approximately 50%) or threshold voltages. Figure 5 shows the delay and power and is very likely to cause timing violations. In contrast, distributions for R1, R2, and R3. In this case the mean power with corner-based worst-case analysis only R1canbechosen follows up μP(R3) <μP(R2) <μP(R1), and the mean delay under the clock cycle time constraint (the worst-case delay of 6 Journal of Electrical and Computer Engineering
T PDF R1 CLK The “maximum” delay distribution can be computed R2 out of the contributing distributions using tightness R3 probability and moment matching [19]. With these two operations, the arriving time distribution of each clock cycle is computed, and the overall timing yield of the DFG is obtained using (5). μD (R1) μD (R2) μD (R3) Delay The total power consumption of a DFG can be computed as the sum of the power consumptions of all the function units used in the DFG. Given a power limitation PLMT, the power yield of the DFG YieldP is computed as the probability PDF P LMT that total power PDFG is less than the requirement, as R3 R 2 R1 expressed in (6):
YieldP = P(PDFG ≤ PLMT). (6) Since dynamic power is relatively immune to process varia- tions, it is regarded as a constant portion which only affects μP(R3) μP(R2) μP(R1) Power the mean value of the total power consumption. Therefore, the total power is still normally distributed, although Figure 5: Motivating example of yield-driven synthesis. statistical analysis is only applied to the leakage power. As aforementioned in Section 3, our proposed yield-driven statistical framework can be stacked on existing approaches R2 slightly violates the limit), whereas R1hasapoorpower for dynamic power optimization, to further reduce the total yield. In fact, if we set a timing yield constraint instead of power consumption of circuits. enforcing the worst-case delay limitation, R2canbechosen with a slight timing yield loss but a well-balanced delay and 4.3. Voltage Level Conversion in HLS. In designs using multi- power tradeoff. Therefore, a yield-driven statistical approach V resource units, voltage level convertors are required is needed for exploring the design space to maximize one dd when a low-voltage resource unit is driving a high-voltage parametric yield under other parametric yield constraints. resource unit. Level conversion can be performed either synchronously or asynchronously. Synchronous level con- 4.2. Statistical Timing and Power Analysis for HLS. High-level version is usually embedded in flip-flops and occurs at the synthesis (HLS) is the process of transforming a behavioral active clock edge, while asynchronous level converters can be description into register level structure description. Opera- inserted anywhere within the combinational logic block. tions such as additions and multiplications in the DFG are When process variations are considered, asynchronous scheduled into control steps. During the resource allocation level converters are even more favorable, because they are and binding stages, operations are bound to corresponding not bounded by clock edges, and timing slacks can be passed function units in the resource library meeting type and through the converters. Therefore, time borrowing can latency requirements. happen between low-voltage and high-voltage resource units. Given the clock cycle time TCLK, the timing yield of the As slow function units (due to variations) may get more time entire DFG, YieldT is defined as to finish execution, the timing yield can be improved, and the impact of process variations is consequently reduced. Yield = P(T ≤ T , T ≤ T , ..., T ≤ T ), T 1 CLK 2 CLK n CLK While many fast and low-power level conversion circuits (5) have been proposed recently, this paper uses the multi-Vth where P() is the probability function, T1, T2, ..., Tn are the level converter presented in [32], taking the advantage that arrival time distributions at control step 1, 2, ..., n,respec- there is no extra process technology overhead for multi- tively. Vth level converters, since multi-Vth is already deployed for The arriving time distribution of each clock cycle can function units. The proposed level converter is composed of be computed from the delay distributions of function units two dual Vth cascaded inverters. Its delay and power are then bound at that cycle. Two operations, sum and max, are used characterized in HSPICE using the listed parameters [32]. to compute the distributions: Thedelaypenaltyofalevelconvertercanbeaccounted by summing its delay with the delay of the function unit (i) sum operation is used when two function units are it is associated to. The power penalty can be addressed by chained in cascade within a clock cycle, as shown counting the level converters used in the DFG and adding in CC1 and CC2 of Figure 6. The total delay can be the corresponding power to the total power consumption. computed as the “sum” of their delay distributions (normal distribution assumed); 5. Yield-Driven Power Optimization Algorithm (ii) max operation is used when the outputs of two or more units are fed to another function unit at In this section, we propose our yield-driven power opti- the same clock cycle, as shown in CC1 of Figure 6. mization framework based on the aforementioned statistical Journal of Electrical and Computer Engineering 7
CC1 CC1 CC1 × + + × + + × + + Sum Sum × Sum Sum × Max × × Max × Max ×
CC2 CC2 × + + CC2 + × Sum + + × + Sum Sum
× × × Max × Sum × Max Sum × Max
ALCs SLCs (a) No level conversion (b) synchronous conversion (c) synchronous conversion
Figure 6: Timing yield computation in multi-Vdd high-level synthesis with different level conversions. Shadowed operations are bound to function units with low-supply voltages, and the bars indicate the insertion of level converters. timing and power yield analysis. During the high-level the power consumption calculation and performance con- synthesis design loop, resource binding selects the optimal straints are specified as deterministic numbers, rather than resource instances in the resource library and binds them to using the concept of power yield and performance yield. the scheduled operations at each control step. A variation- aware resource binding algorithm is then proposed to 5.2. Voltage Level Conversion Strategies. Moves during the maximize power yield under a preset timing yield constraint, iterative search may result in low-voltage resource units driv- by iteratively searching for the operations with the maximum ing high-voltage resource units. Therefore, level conversion potential of timing/power yield improvement, and replacing is needed during resource binding. However, if resources them with better candidates in the multi-V /V resource th dd are selected and bound so that low-voltage resource units library. never drive high-voltage ones, level conversion will not be necessary, and the delay and power overheads brought by 5.1. Variation-Aware Resource Binding Algorithm Overview. level converters can be avoided. This reduces the flexibility Our variation-aware resource binding algorithm takes a of resource binding for multivoltage module combinations, search strategy called variable depth search [19, 33, 34] and may consequently decrease the attainable yield improve- to iteratively improve the power yield under performance ment. The tradeoff in this conversion-avoidance strategy, constraints. The outline of the algorithm is shown in can be explored and evaluated within our proposed power Algorithm 1, where a DFG is initially scheduled and bound optimization algorithm. L H We also incorporate other two strategies of level conver- to resource library with nominal voltages (Vth, Vdd). A lower-bound constraint on the timing yield is set, so that sions in the power optimization algorithm for comparison. the probability of the design can operate at a given clock All the three strategies are listed as follows: frequency, will be larger than or equal to a preset threshold (e.g., 95%). In the algorithm, a move is defined as a local (i) level conversion avoidance: resource binding is per- and incremental change on the resource bindings. As shown formed with the objective that low-voltage resour- in the sub routine GenMove in Algorithm 1, the algorithm ces never drive high-voltage ones. As shown in generates a set of moves and finds out a sequence of moves Figure 6(a), no dark-to-light transition between that maximizes the accumulated gain,whichisdefinedas operations is allowed (while dark operations are ∗Δ Δ α YieldD+ YieldP,whereα is a weighting factor to balance bound to low-Vdd units), so that level conversion is the weights of timing and power yield improvements. The avoided. This is the most conservative strategy; optimal sequence of moves is then applied to the DFG, and the timing and power yields of the DFG are updated before (ii) synchronous level conversion: voltage level conver- the next iteration. The iterative search ends when there is no sion is done synchronously in the level-converting yield improvement or the timing yield constraint is violated. flip-flops (SLCs). As shown in Figure 6(b), the dark- Note that our worst-case resource binding algorithm uses to-light transition only happens at the beginning of the same search strategy (variable depth search) [19, 33, 34] each clock cycles. The flip-flop structure proposed as the variation-aware resource binding algorithm. The key in [35] is claimed to have smaller delay than the difference is that, instead of iteratively improving the power combination of an asynchronous converter and a yield under performance constraints, the worst-case resource conventional flip-flop. However, as discussed previ- binding algorithm iteratively reduces the power consump- ously, synchronous level conversion may reduce the tion under specified performance constraints, where both flexibility of resource binding as well as the possibility 8 Journal of Electrical and Computer Engineering
VABinding(DFG, ResLib, Constraints, LCStrateg y) Initialization (1) Scheduling using ASAP strategy L H (2) Initial Binding to (Vth, Vdd) resources Variation-aware resource binding (3) while ΔYieldP > 0ANDYieldD ≥ Constraint (4) do for i ← 1toMAXMOVES (5) do Gaini ←GenMove(DFG, ResLib, LCStrateg y) (6) Append Gaini to Gain List; (7) Find subsequence Gain1, ..., Gaink in Gain List = k so that G i=1 Gaini is maximized (8) if G>0 (9) do Accept moves 1 ···k (10) Evaluate ΔYieldP and YieldD GenMove(DEG, ResLib, LCStrateg y) (1) MOVE: Choose a move using steepest descent heuristic [33] (2) Check whether and where level conversion is needed (3) if LC Strateg y = Avoidance AND NeedConversion (4) do goto MOVE (5) if LC Strateg y = Synchronous AND NeedConversion Check whether conversion is synchronous or not (6) do if Conversion is inside operation chaining (7) do goto MOVE (8) Count the overhead of level conversion (9) Evaluate the Gain of this move (10) Return Gain
Algorithm 1: Outline of the variation-aware resource binding algorithm.
of timing borrowing. The effectiveness of this strategy is split from the shared function unit. This type of is to be explored by the optimization algorithm; move might lead to other moves such as resource (iii) asynchronous level conversionL: asynchronous level rebinding and resource sharing. converters (ALCs) are inserted wherever level conver- sion is needed, as dark-to-light transition can happen After each move, the algorithm checks where the low- anywhere in Figure 6. This aggressive strategy pro- supply voltage function units are used and decides whether vides the maximum flexibility for resource binding to insert or remove the level converters, according to the and timing borrowing. Although it brings in delay predefined level conversion strategy. If a move is against the and power overhead, it still has great potential for strategy, it is revoked, and new moves are generated until a timing yield improvement. qualifying move is found.
5.3. Moves Used in the Iterative Search. In order to fully 5.4. Algorithm Analysis. It has to be noted that, in the pro- explore the design space, three types of moves are used in the cedure GenMove shown in Algorithm 1, even though the iterative search for resource binding; returned Gain might be negative, it still could be accepted. (i) resource rebinding: in this move, an operation is Since the sequence of a cumulative positive gain is consid- assigned to a different function unit in the library ered, the negative gains help the algorithm escape from local with different timing and power characteristics. The minima through hill climbing. As for the computational complexity, it is generally not key benefit of the multi-Vth/Vdd techniques is that it provides an enlarged design space for exploration, possible to give nontrivial upper bounds of run time for local and optimal improvements are more likely to be search algorithms [33]. However, for variable depth search in obtained; general graph partitioning, Aarts and Lenstra [33]founda near-optimal growth rate of run time to be O(n log n), where (ii) resource sharing: in this move, two function units n is the number of nodes in the graph. In our proposed ff that are originally bound to di erent function units, algorithm, the timing and power yield evaluation, as well as are now merged to share the same function unit. the level converter insertion, are performed at each move. The type of move reduces the resource usage and Since the yield can be updated using a gradient computation consequently improves the power yield; approach [19], the run time for each move is at most O(n). (iii) resource splitting: in this move, the operation that Therefore, the overall run time for the proposed resource originally shared function unit with other operations, binding algorithm is O(n2 log n). Journal of Electrical and Computer Engineering 9
8 8
7 7
6 6
5 5
4 4 Delay (ns) Delay 3 (ns) Delay 3
2 2
1 1
0 0 bkung kogge pmult booth mux21 bkung kogge pmult booth mux21
1W1L 1W1L 2W1L 2W1L
(a) Vth = 0.37 V (b) Vth = 0.56 V
Figure 7: Delay characterization of function units with multi-Vth/Vdd and variation awareness.
12 12
10 10
8 8
6 6 Delay (ns) Delay Delay (ns) Delay 4 4
2 2
0 0 bkung kogge pmult booth mux21 bkung kogge pmult booth mux21
1W1L 1W1L 2W1L 2W1L
(a) Vth = 0.37 V (b) Vth = 0.56 V
Figure 8: Delay characterization of function units with multi-Vth/Vdd and variation awareness.
6. Experimental Results characterization results for five function units, including two 16-bit adders bkung and kogge, two 8-bit × 8-bit multipliers In this section, we present the experimental results of our pmult and booth, and one 16-bit multiplexer mux21,are variation-aware power optimization framework for high- depicted in Figures 7, 8, 9, 10. In the figures, the color bars ff level synthesis. The results show that our method can e ec- show the nominal case values, while the error bars show the tively improve the overall power yield of given designs and deviations. It is clearly shown that with lower Vdd and/or reduce the impact of process variations. higher Vth, significant power reductions are achieved at the We first show the variation-aware delay and power char- cost of delay penalty. Meanwhile, up sizing the transistor acterization of function units. The characterization is based can improve the circuit performance but also yield to larger on NCSU FreePDK 45 nm technology library [23]. The power consumption. In terms of variability mitigation, both L = voltage corners for the characterization are set as Vth voltage scaling and device sizing have significant impact on H = L = H = 0.37 V, Vth 0.56 V, Vdd 0.9V,and Vdd 1.1V.The the delay and leakage variations. We can explore this trend 10 Journal of Electrical and Computer Engineering
40 20 35 18 16 30 14 W) W) 25 12 μ μ 20 10 15 8 Leakage ( Leakage Leakage ( Leakage 6 10 4 5 2 0 0 bkung kogge pmult booth mux21 bkung kogge pmult booth mux21
1W1L 1W1L 2W1L 2W1L
(a) Vth = 0.37 V (b) Vth = 0.56 V
Figure 9: Leakage power characterization of function units with multi-Vth/Vdd and variation awareness.
20 20 18 18 16 16 14 14 W) 12 W) 12 μ μ 10 10 8 8 Leakage ( Leakage 6 ( Leakage 6 4 4 2 2 0 0 bkung kogge pmult booth mux21 bkung kogge pmult booth mux21
1W1L 1W1L 2W1L 2W1L
(a) Vth = 0.37 V (b) Vth = 0.56 V
Figure 10: Leakage power characterization of function units with multi-Vth/Vdd and variation awareness. further in Figures 11 and 12, where the delay and power power is about 2 times the mean leakage power. The power distributions of the function unit bkung is sampled at a third yield before and after the improvement is then computed Vth of 0.45 V. The plotted curves show that the magnitude of using (6)inSection 4.2. The proposed resource binding delay variation increases for higher Vth units, which means algorithm is implemented in C++, and experiments are larger probabilities of timing violations if these high Vth conducted on a Linux workstation with Intel Xeon 3.2 GHz units are placed on near-critical paths. The figures also show processor and 2 GB RAM. All the experiments run in less that up-sizing the device can effectively reduce the delay and than 60 s of CPU time. leakage variations, as depicted by the error bars in Figures 11 We compare our variation-aware resource binding algo- and 12. rithm against the traditional deterministic approach, which With the variation-aware multi-Vth/Vdd resource library uses the worst-case (μ +3σ) delay values of function units in characterized, our proposed resource binding algorithm is the multi-Vth/Vdd library to guide the resource binding. For applied on a set of industrial high-level synthesis bench- deterministic approach, we leverage a commercial HLS tool marks, which are listed in Table 1. A total power limitation called Catapult-C to obtain the delay/area/power estimation. PLMT is set for each benchmark to evaluate the power yield The worst-cased based approach will naturally lead to 100% improvement. The dynamic power consumption of function timing yield; however, the power yield is poor as shown units is estimated by Synopsys Design Compiler with multi- in the motivational example in Figure 5. In contrast, our Vth/Vdd technology libraries generated by Liberty NCX. In yield-aware statistical optimization algorithm takes the delay this work with FreePDK 45 nm technology, the dynamic and power distributions as inputs, explores the design space Journal of Electrical and Computer Engineering 11
5.5 5 30
5 4 25 W) μ 20 4.5 3 15
Delay (ns) Delay 4 2 10
3.5 1 ( power Leakage 5 3 (%) yield improvement Power 0 0 0.37 0.45 0.56
V (V) PR th DIR EWF MCM CHEM WANG
Delay STEAM HONDA Leakage power Benchmarks
Figure 11: The delay and power tradeoff with increasing Vth for bkung with default device sizing (1W1L). LC-avoidance LC-synchronous LC-asynchronous 4 10 Figure 13: Power yield improvement against deterministic worst- ff 3.5 8 case approach with di erent level conversion strategies and timing W)
μ yield constraint of 95%. 3 6
2.5 4 14 Delay (ns) Delay 12 2 2 Leakage power ( power Leakage 10 1.5 0 0.37 0.45 0.56 8 V th (V) 6 Delay 4 Leakage power 2 Figure 12: The delay and power tradeoff with increasing Vth for (%) yield improvement Power bkung with doubled device sizing (2W1L). 0 PR DIR EWF MCM CHEM WANG Table 1:Theprofileoftestbenchmarks. STEAM HONDA
Nodes Edges Add Benchmarks Name Mult number number number number 99% timing yield CHEM 33 51 15 18 95% timing yield EWF 34 49 20 12 90% timing yield PR 44 132 26 16 Figure 14: Power yield improvement against deterministic worst- WANG 52 132 26 24 case approach with multi-Vth only and different timing yield con- MCM 96 250 64 30 straints. HONDA 99 212 45 52 DIR 150 312 84 64 Table 2: The Overheads of voltage level converters. STEAM 222 470 105 115 Type Delay (ps) Power (nW) Synchronous converter 80 Negligible with the guidance of YieldGain, and iteratively improves the Asynchronous converter 200 3790 power yield under a slight timing yield loss. The comparison results are shown in Figures 13, 14, 15 and 16,respectively. Figure 13 shows the power yield improvement against Table 2. The usage of function units and level converters worst-case delay based approach, with different level con- under the three listed conversion strategies (conversion version strategies. A fixed timing yield constraint of 95% is avoidance, synchronous conversion and asynchronous con- set for the proposed variation-aware algorithm, using the version) is listed in Table 3, in which “Vdd-H FUs number” function units with default device sizes (1W1L). The over- and “Vdd-L FUs number” show the numbers of function heads of the level converters used in this paper are listed in units with high/low supply voltages, respectively, and “LCs 12 Journal of Electrical and Computer Engineering
Table 3: The usage of function units and level converters with different level conversion strategies.
Bench LC-avoidance LC-synchronous LC-asynchronous name Vdd-H Vdd-L Vdd-H Vdd-L LCs Vdd-H Vdd-L LCs Overhead CHEM513312424.2% EWF614313423.7% PR726324534.1% WANG928325633.4% MCM 20 4 16 8 4 15 9 5 2.4% HONDA 22 5 17 10 6 15 12 6 3.9% DIR284201281418124.8% STEAM 34 8 25 19 11 21 23 13 5.0%
35 number” counts the number of converters used in the design. 30 The last column counts the total power overhead of the asyn- chronous level converters. The average power yield improve- 25 ments for the three strategies are 11.7%, 17.9%, and 22.2%, 20 respectively. From Figure 13 and Table 3 we can see that larger power yield improvements can be achieved when more 15 low-Vdd function units are used in the design. The results 10 also validate our claims in Section 4.3 and Section 5.2 that, 5 asynchronous level conversion is more favorable in statistical
Power yield improvement (%) yield improvement Power optimization because it enables timing borrowing between 0 function units and leads to the timing yield improvement PR
DIR that can compensate the overhead of the converters. There- EWF MCM CHEM WANG fore, compared to the synchronous case, more asynchronous STEAM HONDA converters are used while yielding better results. Benchmarks Figure 14 shows power yield improvement with multi- Vth technique only, which means only the resource units with 99% timing yield H 95% timing yield nominal supply voltage Vdd can be selected. In this case, 90% timing yield no level conversion is needed so there is no overhead for level converters. Only function units with default device sizes Figure 15: Power yield improvement against deterministic worst- (1W1L) are used. The average power yield improvements ff case approach with asynchronous level conversion and di erent against worst-case delay based approach, under timing yield timing yield constraints. constraints 99%, 95%, and 90% are 5.7%, 7.9%, and 9.8%, respectively. At timing yield 95%, the average power yield improvement (7.9%) is smaller than the LC-avoidance case 40 (11.5%) in Figure 13, which shows that using multi-Vdd 35 resource units can further improve the power yield. 30 Figure 15 shows the power yield improvement against ff 25 worst-case delay based approach, under di erent timing yield constraints. Asynchronous level conversion is chosen in 20 this series of experiments. Only function units with default 15 device sizes (1W1L) are used. The average power yield 10 improvements under timing yield constraints 99%, 95%, and 5 90% are 11.6%, 20.6%, and 26.9%, respectively. It is clearly Power yield improvement (%) yield improvement Power 0 shown that, the power yield improvement largely depends on howmuchtimingyieldlossisaffordable for the design. This PR
DIR will further push forward the design space exploration for a EWF MCM CHEM WANG ff
STEAM well-balanced timing and power tradeo . HONDA Figure 16 shows the power yield improvement against Benchmarks worst-case delay-based approach, using function units with 1W1L different device sizes. Asynchronous level conversion is 1W1L + 2W1L chosen in this series of experiments, and a fixed timing yield Figure 16: Power yield improvement against deterministic worst- constraint of 95% is set for the proposed variation-aware case approach with asynchronous level conversion and different algorithm. Compared with the average 20.6% yield improve- timing yield constraints. ment in the case using default device size (1W1L) only, using Journal of Electrical and Computer Engineering 13 both default-size (1W1L) and double-size (2W1L) resources [9] A. Srivastava, T. Kachru, and D. Sylvester, “Low-power-design can lead to an average power yield improvement of 30.9%. space exploration considering process variation using robust Obviously, upsized device with higher performance and optimization,” IEEE Transactions on Computer-Aided Design of smaller variability provide additional flexibility for design Integrated Circuits and Systems, vol. 26, no. 1, pp. 67–78, 2007. space exploration; however, this is achieved at the cost of [10] K. Usami and M. Igarashi, “Low-power design methodology larger silicon area. and applications utilizing dual supply voltages,” in Proceedings of the Design Automation Conference (ASPDAC ’00), pp. 123– 128, Yokohama, Japan, 2000. 7. Conclusions [11] A. Srivastava and D. Sylvester, “Minimizing total power by simultaneous Vdd/Vth assignment,” IEEE Transactions on In this paper, we investigate the impact of process vari- Computer-Aided Design of Integrated Circuits and Systems, vol. ations on multi-Vth/Vdd and device sizing techniques for 23, no. 5, pp. 665–677, 2004. low-power-high-level synthesis. We characterize delay and [12]C.P.Chen,C.C.N.Chu,andD.F.Wong,“Fastandexact power variations of function units under different threshold simultaneous gate and wire sizing by lagrangian relaxation,” and supply voltages, and feed the variation-characterized in Proceedings of the IEEE/ACM International Conference on resource library to the HLS design loop. Statistical timing and Computer-Aided Design (ICCAD ’98), pp. 617–624, ACM, New power analysis for high-level synthesis is then introduced, York, NY, USA, 1998. to help our proposed resource binding algorithm explore [13] S. Sirichotiyakul, T. Edwards, C. Oh et al., “Stand-by power the design space and maximize the power yield of designs minimization through simultaneous threshold voltage selec- under given timing yield constraints. Experimental results tion and circuit sizing,” in Proceedings of the 1999 36th Annual show that significant power reduction can be achieved with Design Automation Conference (DAC ’99), pp. 436–441, June the proposed variation-aware framework, compared with 1999. traditional worst-case based deterministic approaches. [14] L. Wei, K. Roy, and C. K. Koh, “Power minimization by simul- taneous dual-Vth assignment and gate-sizing,” in Proceedings of the 22nd Annual Custom Integrated Circuits Conference Acknowledgment (CICC ’00), pp. 413–416, May 2000. [15] P. Pant, R. K. Roy, and A. Chatterjee, “Dual-threshold voltage This work was supported in part by NSF 0643902,0903432, assignment with transistor sizing for low power CMOS cir- and 1017277, NSFC 60870001/61028006 and a grant from cuits,” IEEE Transactions on Very Large Scale Integration SRC. Systems, vol. 9, no. 2, pp. 390–394, 2001. [16] T. Karnik, Y. Ye, J. Tschanz et al., “Total power optimization by References simultaneous dual-Vt allocation and device sizing in high per- formance microprocessors,” in Proceedings of the 39th Annual [1]P.CoussyandA.Morawiec,High-Level Synthesis: From Algo- Design Automation Conference (DAC ’02), pp. 486–491, June rithm to Digital Circuit, Springer, 2009. 2002. [2] W. T. Shiue, “High level synthesis for peak power mini- [17] S. Insup, P. Seungwhun, and S. Youngsoo, “Register allocation mization using ILP,” in Proceedings of the IEEE International for high-level synthesis using dual supply voltages,” in Pro- Conference on Application-Specific Systems, Architectures, and ceedings of the 46th ACM/IEEE Design Automation Conference Processors, pp. 103–112, July 2000. (DAC ’09), pp. 937–942, July 2009. [3] K. S. Khouri and N. K. Jha, “Leakage power analysis and [18] S. P. Mohanty and E. Kougianos, “Simultaneous power fluc- reduction during behavioral synthesis,” IEEE Transactions on tuation and average power minimization during nano-CMOS Very Large Scale Integration Systems, vol. 10, no. 6, pp. 876– behavioral synthesis,” in Proceedings of the 20th International 885, 2002. Conference on VLSI Design held jointly with 6th International [4] X. Tang, H. Zhou, and P. Banerje, “Leakage power optimiza- Conference on Embedded Systems (VLSID ’07), pp. 577–582, tion with dual-Vth library in high-level synthesis,” in Proceed- January 2007. ings of the 42nd Design Automation Conference (DAC ’05),pp. [19] F. Wang, X. Wu, and Y. Xie, “Variability-driven module 202–207, June 2005. selection with joint design time optimization and post-silicon [5] W. L. Hung, X. Wu, and Y. Xie, “Guaranteeing performance tuning,” in Proceedings of the Asia and South Pacific Design yield in high-level synthesis,” in Proceedings of the International Automation Conference (ASP-DAC ’08), pp. 2–9, March 2008. Conference on Computer-Aided Design (ICCAD ’06), pp. 303– [20] R. W. Keyes, “Physical limits in digital electronics,” Proceedings 309, November 2006. of the IEEE, vol. 63, no. 5, pp. 740–767, 1975. [6] J. Jung and T. Kim, “Timing variation-aware high-level synthe- sis,” in Proceedings of the IEEE/ACM International Conference [21] D. S. Boning and S. Nassif, “Models of process variations on Computer-Aided Design (ICCAD ’07), pp. 424–428, Novem- in device and interconnect,” in Design of High Performance ber 2007. Microprocessor Circuits, IEEE Press, 2000. [7] F. Wang, G. Sun, and Y. Xie, “A variation aware high level [22] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “Analysis and synthesis framework,” in Proceedings of the Design, Automation mitigation of variability in subthreshold design,” in Proceed- and Test in Europe (DATE ’08), pp. 1063–1068, March 2008. ings of the International Symposium on Low Power Electronics [8] G. Lucas, S. Cromar, and D. Chen, “FastYield: Variation-aware, and Design, pp. 20–25, August 2005. layout-driven simultaneous binding and module selection for [23] NCSU, “45 nm FreePDK,” http://www.eda.ncsu.edu/wiki/ performance yield optimization,” in Proceedings of the Asia and FreePDK. South Pacific Design Automation Conference (ASP-DAC ’09), [24] M. Meterelliyoz, A. Goel, J. P. Kulkarni, and K. Roy, “Accurate pp. 61–66, January 2009. characterization of random process variations using a robust 14 Journal of Electrical and Computer Engineering
low-voltage high-sensitivity sensor featuring replica-bias cir- cuit,” in Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC ’10), pp. 186–187, February 2010. [25] C. Jacoboni and P. Lugli, The Monte Carlo Method for Semi- conductor Device Simulation, Springer, 1990. [26] N. S. Kim, T. Austin, D. Blaauw et al., “Leakage current: Moore’slawmeetsstaticpower,”Computer, vol. 36, no. 12, pp. 68–64, 2003. [27] K. M. Cao, W. C. Lee, W. Liu et al., “BSIM4 gate leakage model including source-drain partition,” in Proceedings of the IEEE International Electron Devices Meeting, pp. 815–818, December 2000. [28] N. C. Beaulieu, A. A. Abu-Dayya, and P. J. McLane, “Compar- ison of methods of computing lognormal sum distributions and outages for digital wireless applications,” in Proceedings of the IEEE International Conference on Communications,pp. 1270–1275, May 1994. [29] S. H. Kulkarni, A. N. Srivastava, and D. Sylvester, “A new algo- rithm for improved VDD assignment in low power dual VDD systems,” in Proceedings of the 2004 International Symposium on Lower Power Electronics and Design (ISLPED ’04), pp. 200– 205, August 2004. [30] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “Matching properties of MOS transistors,” IEEE Journal of Solid-State Circuits, vol. 24, no. 5, pp. 1433–1440, 1989. [31] J. Kwong and A. P. Chandrakasan, “Variation-driven device sizing for minimum energy sub-threshold circuits,” in Pro- ceedings of the 11th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED ’06), pp. 8–13, October 2006. [32] S. A. Tawfik and V. Kursun, “Multi-Vth level conversion cir- cuits for multi-VDD systems,” in Proceedings of the IEEE Inter- national Symposium on Circuits and Systems (ISCAS ’07),pp. 1397–1400, May 2007. [33] E. Aarts and J. K. Lenstra, Local Search in Combinatorial Optimization, Princeton University Press, 2003. [34] A. Raghunathan and N. K. Jha, “Iterative improvement algo- rithm for low power data path synthesis,” in Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 597–602, November 1995. [35] F. Ishihara, F. Sheikh, and B. Nikolic,´ “Level conversion for dual-supply systems,” IEEE Transactions on Very Large Scale Integration Systems, vol. 12, no. 2, pp. 185–195, 2004. Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2012, Article ID 691864, 24 pages doi:10.1155/2012/691864
Research Article Task-Level Data Model for Hardware Synthesis Based on Concurrent Collections
Jason Cong, Karthik Gururaj, Peng Zhang, and Yi Zou
Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA
Correspondence should be addressed to Karthik Gururaj, [email protected]
Received 17 October 2011; Revised 30 December 2011; Accepted 11 January 2012
Academic Editor: Yuan Xie
Copyright © 2012 Jason Cong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The ever-increasing design complexity of modern digital systems makes it necessary to develop electronic system-level (ESL) methodologies with automation and optimization in the higher abstraction level. How the concurrency is modeled in the application specification plays a significant role in ESL design frameworks. The state-of-art concurrent specification models are not suitable for modeling task-level concurrent behavior for the hardware synthesis design flow. Based on the concurrent collection (CnC) model, which provides the maximum freedom of task rescheduling, we propose task-level data model (TLDM), targeted at the task-level optimization in hardware synthesis for data processing applications. Polyhedral models are embedded in TLDM for concise expression of task instances, array accesses, and dependencies. Examples are shown to illustrate the advantages of our TLDM specification compared to other widely used concurrency specifications.
1. Introduction accelerators with bus interfaces. At this time, it is not easy for the current HLS tools to handle task-level optimizations As electronic systems become increasingly complex, the mo- such as data transfer between accelerators, programmable tivation for raising the level of design abstraction to the cores, and memory hierarchies. The sequential C/C++ pro- electronic system level (ESL) increases. One of the biggest gramming language has inherent limitations in speci- challenges in ESL design and optimization is that of effi- fying the task-level parallelism, while SystemC requires ciently exploiting and managing the concurrency of a large many implementation details, such as explicitly defined amount of parallel tasks and design components in the port/module structures. Both languages impose significant system. In most ESL methodologies, task-level concurrency constraints on optimization flows and heavy burdens for specifications, as the starting point of the optimization flow, algorithm/software designers. A tool-friendly and designer- play a vital role in the final implementation quality of results friendly concurrency specification model is vital to the (QoR). The concurrency specification encapsulates detailed successful application of the automated ESL methodology in behavior within the tasks, and explicitly specifies the coarse- grained parallelism and the communications between the practical designs. tasks. System-level implementation and optimization can be The topic of concurrency specification has been re- performed directly from the system-level information of the searched for several decades—from the very early days of application. Several ESL methodologies have been proposed computer science to today’s research in parallel program- previously. We refer the reader to [1, 2] for a comprehensive ming. From the ESL point of view, some of the results were survey of the state-of-art ESL design flows. too general and led to high implementation costs for general High-level synthesis (HLS) is a driving force behind the hardware structures [4, 5], and some results were too re- ESL design automation. Modern HLS systems can generate stricted and could only model a small set of applications register transaction-level (RTL) hardware specifications that [6, 7]. In addition, most of the previous models focused come quite close to hand-generated designs [3] for synthesiz- only on the description of the behavior or computation, ing computation-intensive modules into a form of hardware but introduced redundant constraints unintentional for 2 Journal of Electrical and Computer Engineering implementation—such as the restrictive execution order (i) Application scope: the range of applications that the of iterative task instances. CnC [8] proposed the concept MoC can model or efficiently model. of decoupling the algorithm specification with the imple- (ii) Ease of use: the effort required for a designer to specify mentation optimization; this provides the larger freedom the application using the MoC. of task scheduling optimization and a larger design space for potentially better implementation QoR. But CnC was (iii) Suitability for automated optimization: while a highly originally designed for multicore processor-based platforms genericMoCmightbeabletomodelalargeclassof applications with minimal user changes, it might be that contain a number of dynamic syntax elements in the ffi ffi specification—such as dynamic task instance generation, very di cult to develop e cient automatic synthesis dynamic data allocation, and unbounded array index. A flows for such models. hardware-synthesis-oriented concurrency specification for (iv) Suitability for the target platform:forexample,a both behavior correctness and optimization opportunities is MoC, which implicitly assumes a shared memory needed by ESL methodologies to automatically optimize the architecture (such as CnC [8]), may not be well suited design QoR. for synthesis onto an FPGA platform where support In this paper we propose a task-level concurrency for an efficient shared memory system may not exist. specification (TLDM) targeted at ESL hardware synthesis While most of these choices appear highly subjective, we based on CnC. It has the following advantages. list some characteristics that we believe are essential to an (1) Allowing maximum degree of concurrency for task MoC under consideration for automated synthesis. scheduling. ff (i) Deterministic execution: unless the application (2) Support for the integration of di erent module-level domain/system being modeled is nondeterministic, specifications. the MoC should guarantee that, for a given input, (3) Support for mapping to heterogeneous computation execution proceeds in a deterministic manner. This platforms, including multicore CPUs, GPUs, and makes it more convenient for the designer (and ESL FPGAs (e.g., as described in [9]). tools) to verify correctness when generating different (4) Static and concise syntax for hardware synthesis. implementations. (ii) Hierarchy: in general, applications are broken down The remainder of our paper is organized as follows. ff Section 2 briefly reviews previous work on concurrency spec- into subtasks, and di erent users/teams could be ifications. Section 3 overviews the basic concepts in CnC. involved in designing/implementing each subtask. Section 4 presents the details of our TLDM specification. In TheMoCshouldbepowerfulenoughtomodelsuch applications in a hierarchical fashion. An MoC that Section 5 we illustrate the benefits of our TLDM specification ffi with some concrete examples. Finally, we conclude our paper supports only a flat specification would be di cult to in Section 6 with a discussion of ongoing work. work with because of the large design space available. (iii) Support of heterogeneous target platforms and refine- 2. Concurrent Specification Models ment: modern SoC platforms consist of a variety of components—general-purpose processor cores, cus- A variety of task-level concurrent specification models exist, tom hardware accelerators (implemented on ASICs and each concurrent specification model has its underlying or FPGAs), graphics processing units (GPUs), mem- model of computation (MoC) [10]. One class of these ory blocks, and interconnection fabric. While it models is derived from precisely defined MoCs. Another may not be possible for a single MoC specification class is derived from extensions of sequential programming to efficiently map to different platforms, the MoC languages (like SystemC [11]) or hardware description lan- should provide directives to refine the application guages (like Bluespec System Verilog [12]) in which the specification so that it can be mapped to the target underlying MoC has no precise or explicit definitions. These platform. This also emphasizes the need for hierarchy languages always have the ability to specify different MoCs because different subtasks might be suited to different at multiple abstraction levels. In this section we focus on the components (e.g., FPGA versus GPUs); hence, the underlying MoCs in the concurrent specification models and refinements could be specific to a subtask. ignore the specific languages that are used to textually express these MoCs. 2.1. General Models of Computation. We start our review of The analyzability and expressibility of the concurrent previous work with the most general models of computation. specification model is determined by the underlying MoC These models impose minimum limitations in the specifi- [10]. Different MoCs define different aspects of task concur- cation and hence can be broadly applied to describe a large rency and implementation constraints for the applications. variety of applications. The intrinsic characteristics of each specific MoC are used Communicating sequential process (CSP)[4] allows the to build the efficient synthesizer and optimizer for the MoC. description of systems in terms of component processes that The choice of the MoC greatly influences the applicable operate independently and interact with each other solely optimizations and the final implementation results as well. through message-passing communication. The relationships The key considerations in MoC selection are the following. between different processes, and the way each process Journal of Electrical and Computer Engineering 3 communicates with its environment, are described using The specification of computation/communication could be various process algebraic operators. cycle-accurate, approximately timed or untimed, purely fun- Hierarchy is supported in CSP where each individual ctional, or implementation specific (e.g., a bus for commu- process can itself be composed of subprocesses (whose inter- nication). action is modeled by the available operators). CSP allows The main concern with using general models for hard- processes to interact in a nondeterministic fashion with ware synthesis is that the models may not be suitable for the environment; for example, the nondeterministic choice analysis and optimization by the synthesis tool. This could operator in a CSP specification allows a process to read a pair lead to conservative implementations that may be inefficient. of events from the environment and decide its behavior based on the choice of one of the two events in a nondeterministic 2.2. Process Networks. A process network is the abstract fashion. model in most graphical programming environments, where One of the key applications of the CSP specification is the nodes of the graph can be viewed as processes that the verification of large-scale parallel applications. It can be run concurrently and exchange data over the arcs of the applied to detect deadlocks and livelocks between the con- graph. Processes and their interactions in process networks current processes. Examples of tools that use CSP to per- are much more constrained than those of CSP. Determinism form such verification include FDR2 [13]andARC[14]. is achieved by two restrictions: (1) each process is specified Verification is performed through a combination of CSP as a deterministic program, and (2) the quantity and the model refinement and CSP simulation. CSP is also used for order of data transfers for each process are statically defined. software architecture description in a Wright ADL project Process networks are widely used to model the data flow [15] to check system consistency; this approach is similar to characteristics in data-intensive applications, such as signal FDR2 and ARC. processing. Petri nets [16] consist of places that hold tokens (tokens We address three representative process network MoCs represent input or output data) and transitions that describe (KPN, DPN, and SDF). They differ in the way that they the process of consuming and producing tokens (transitions specify the execution/firing and data transfer; this brings dif- are similar to processes in CSP). A transition is enabled when ferences in determining the scheduling and communication the number of tokens on each input arc is greater than or channel size. equal to the required number of input tokens. When multiple transitions are enabled at the same time, any one of them 2.2.1. Kahn Process Network (KPN). KPN [20] defines a set of can fire in a nondeterministic way; also, a transition need sequential processes communicating through unbounded not fire even if it is enabled. Extensions were proposed to first-in-first-out (FIFO) channels. Writes to FIFOs are non- the Petri net model to support hierarchy [17]. Petri nets are blocking (since the FIFOs are unbounded), and reads from used for modeling distributed systems—the main aim being FIFOs are blocking—which means the process will be to determine whether a given system can reach any one of blocked when it reads an empty FIFO channel. Peeking the user-specified erroneous states (starting from some initial into FIFOs is not allowed under the classic KPN model. state). Applications modeled as KPN are deterministic: for a given input sequence, the output sequence is independent of the Event-Driven Model (EDM). The execution of concurrent timing of the individual processes. processes in EDM is triggered by a series of events. The events Data communication is specified in the process program could be generated by the environment (system inputs) in terms of channel FIFO reads and writes. The access or processes within the system. This is an extremely gen- patterns of data transfers can, in general, be data dependent eral model for specifying concurrent computation and can, and dynamically determined in runtime. It is hard to in fact, represent many specific models of computation statically analyze the access patterns and optimize the process [18]. This general model can easily support hierarchical scheduling based on them. A FIFO-based self-timed dynamic specification but cannot provide deterministic execution in scheduling is always adopted in KPN-based design flows, general. where the timing of the process execution is determined by Metropolis [18] is a modeling and simulation environ- the FIFO status. ment for platform-based designs that uses the event-driven Daedulus [21] provides a rich framework for exploration execution model for functionally describing application/ and synthesis of MPSoC systems that use KPNs as a model computation. Implementation platform modeling is also of computation. In addition to downstream synthesis and provided by users as an input to Metropolis, from which the exploration tools, Daedulus provides a tool called KPNGen tool can perform synthesis, simulation, and design refine- [21], which takes as input a sequential C program consisting ment. of affine loop nests and generates the KPN representation of the application. Transaction-Level Models (TLMs). In [19] the authors define six kinds of TLMs; however, the common point of all TLMs 2.2.2. Data Flow Process Network (DPN). The dataflow proc- is the separation of communication and computation. The ess network [5] is a special case of KPN. In dataflow process kinds of TLMs differ in the level of detail specified for networks, each process consists of repeated “firings” of the different computation and communication components. a dataflow “actor.” An actor defines a (often functional) 4 Journal of Electrical and Computer Engineering quantum of computation. Actors are assumed to fire (execute atomically) when a certain finite number of input tokens B are available on each input edge (arc). A firing is defined as B consuming a certain number of input tokens and producing a certain number of output tokens. The firing condition A A for each actor and the tokens consumed/produced during the firing are specified by a set of firing rules that can be tested in a predefined order using only blocking read. The C C D mechanism of actor firings helps to reduce the overhead of context switching in multicore platforms because the (a) (b) synchronization occurs only at the boundary of the firing, not within the firings. But compared to KPN, this benefit Figure 1: Hierarchical FSMs. does not help much in hardware synthesis.
2.2.3. Synchronous Data Flow Graph (SDF). SDF [7]isa more restricted MoC than DPN, in which the number of A X AX tokens that can be consumed/produced by each firing of a node is fixed statically. The fixed data rate feature can be p q (p,q) used to efficiently synthesize the cyclic scheduling of the firings at the compile time. Algorithms have been developed to statically schedule SDFs (such as [7]). An additional B Y BY benefit is that for certain kinds of SDFs (satisfying a math- ematical condition based on the number of tokens pro- duced/consumed by each node), the maximum buffer size (a) (b) needed can be determined statically [7]. Figure 2: Parallel FSMs. StreamIt [22] is a programming language and a compila- tion infrastructure that uses the SDF MoC for modeling real streaming applications. An overall implementation and opti- mization framework is built to map from SDF-based spec- ifications of large streaming applications to various general- model to support concurrent and communicating FSMs, purpose architectures such as uniprocessors, multicore archi- such as StateChart [6]andCodesignFSM(CFSM)[23]. tectures, and clusters of workstations. Enhancements pro- Figure 1 shows that, in a hierarchical FSM, separate states posed to the classic SDF model include split and join nodes and their transitions can be grouped into a hierarchical state that model data parallelism of the actors in the imple- D, and state D encapsulates the details of state B and C. mentations. In general, StreamIt has the same application Similarly, Figure 2 shows a concurrent FSM (Figure 2(a)) expressiveness with SDF. and one transition of its combined FSM (Figure 2(b)). A state of combined FSM is a combination of the states of KPN, DPN, and SDF are suitable for modeling the con- the concurrent FSMs. The Koski synthesis flow [24]forESL currency in data processing applications, in which execution synthesis of MPSoC platforms uses the StateChart model for is determined by the data availability. The possibility of data describing functionality of the system. All the state changes in streaming (pipelining) is intrinsically modeled by the FIFO- the StateChart are assumed to be synchronous, which means, based communication. However, these models are not user- in the cases of transition from AX to BY, conditions p and friendly enough as a preferred concurrency specification for q are validated simultaneously, and there is no intermediate data processing applications, because users need to change state like AY or BX during the transition. However, in real the original shared-memory-based coding style into a FIFO- systems the components being modeled as FSMs can make based one. In addition, these models are not tool-friendly ff ff state changes at di erent times (asynchronously). CFSM enough in the sense that (1) data reuse and bu er space breaks the limitation and introduces asynchronous commu- reuse are relatively harder to perform in this overconstrained nication between multiple FSMs. The Polis system [25] uses FIFO-based communication, (2) no concise expressions for the CFSM model to represent applications for hardware- iterative task instances (which perform identical behavior on ff software codesign, synthesis, and simulation. di erent data sets) are embedded in one sequential process For a real application with both control and data flow, it with a fixed and overconstrained execution order, and (3) it is possible to embed the control branches into the module/ is hard to model access conflicts in shared resources, such as ff actor specification of a data streaming model. For example, o -chip memory, which is an essential common problem in as mentioned in [26], additional tokens can be inserted into ESL design. the SDF specification to represent global state and irregular (conditional) actor behavior. However, this would lose the 2.3. Parallel Finite State Machines (FSMs). FSMs are mainly opportunities to optimize the control path at system level, used to describe applications/computations that are control- which may lead to a larger buffer size and more write intensive. Extensions have been proposed to the classic FSM operations to the buffer. Journal of Electrical and Computer Engineering 5
There is a series of hybrid specifications containing launch computations that can run in parallel with the thread both FSM and data flow. Finite state machine with dat- that invokes spawn and the sync construct which makes the apath (FSMD) models the synchronized data paths and invoking thread wait for all the spawned computations to FSM transitions at clock cycle level. Synchronous piggyback complete and return. The Cilk implementation also involves dataflow networks [26] introduce a global state table into a runtime manager that decides how the computations the conventional SDF to support conditional execution. This generated by spawn operations are assigned to different model of computation was used by the PeaCE ESL synthesis threads. framework [27], which, in turn, is an extension of the Habanero-Java/C [33, 34] includes a set of task parallel Ptolemy framework [19]. programming constructs, in a form of the extensions to The FunState model [28] combines dynamic data flow standard Java/C programs, to take advantage of today’s mul- graphs with a hierarchical and parallel FSM similar to ticore and heterogeneous architectures. Habanero-Java/C has StateCharts. Each transition in the FSM may be equivalent to two basic primitives: async and finish. The async statement, the firing of a process (or function) in the network. The con- async
B[i][j][0..P-1] any data elements with the same B0 and B1 but different B2 task1 for (i=0;i
Table 1: Syntax and semantics summary of CnC and TLDM.
CnC TLDM Computation Step Task Data item Data Communications Control item Not Supported Control item tag Iterator vector Index of set Data item tag Subscript vector (1) Control item is Execution model Input data are ready generated (2) Input data are ready Unbounded and Iteration and data Bounded and dynamically domain statically specified determined Specified in task Data accesses Embedded in step body definition Can be explicitly Dependence Implicitly specified specified DSA Enforced Not enforced
class tldm task { string task id; tldm iteration domain∗ domain; vector
Listing 2
task body in our TLDM specification. A task body can be in an iterative algorithm, a data-dependent convergence con- explicitly specified as a C/C++ task function, or a pointer to dition needs to be checked in each iteration. In TLDM, we the in-memory object in an intermediate representation such separate the iterators into data independent (first three cases) as a basic block or a statement, or even the RTL implementa- and data-dependent (the fourth case). We model data- tions (See Listing 2). independent iterators in a polyhedral matrix form (class An iteration domain specifies the range of the iterator polyhedral set is described below in this subsection); we vectors for a task. We consider the boundaries of iterators in model data dependent iterators in a general form as a TLDM four cases. In the first simple case, the boundaries of all the expression, which specifies the execution condition for the iterators are determined by constants or precalculated pa- task instance. The execution conditions of multiple data- rameters that are independent of the execution of the current dependent iterators are merged into one TLDM expression task. In the second case, boundaries of the inner iterator are in a binary-tree form (See Listing 3). in the form of a linear combination of the outer iterators The data set specifies all the data storage elements in and parameters (such as a triangular loop structure). The the dataflow between different tasks or different instances of iteration domain of the first two cases can be modeled the same tasks. Each tldm data object is a multidimensional directly by a polyhedral model in a linear matrix form as array variable or just a scalar variable in the application. Each affine iterator range. In the third case, the boundaries of the data object has its member scope to specify in which task inner iterators are in a nonlinear form of the outer iterators hierarchy the data object is available. In other words, the and parameters. By considering the nonlinear term of outer data object is accessible only by the tasks within its scope iterators as pseudoparameters for the inner iterators, we can hierarchy. If the data object is global, its scope is set to also handle the third case in a linear matrix form by in- be NULL. The boundaries (or sizes) of a multidimensional troducing separate pseudoparameters in the linear matrix array are predefined constants and modeled by a polyhedral form. In the fourth and most complex case, the iterator matrix subscript range. The detailed form of subscript range boundary is determined by some local data-dependent is similar to that of affine iterator range in the iteration variables varying in each instance of the task. For example, domain. Accesses out of the array bound are forbidden in Journal of Electrical and Computer Engineering 9
class tldm iteration domain{ vector
Listing 3
class tldm data { string data id; tldm task∗ scope; int n dimension; // number of dimensions polyhedral set subscript range; // ranges in each dimensions tldm data body∗ body; };
Listing 4 our TLDM execution model. A body pointer for data objects the program semantics. The access0 and access1 fields point is also kept to refer to the array definition in the detailed to the corresponding tldm access objects and can be optional specification (See Listing 4). for the user-specified dependence. In most practical cases, The access set specifies all the data access mappings the dependence between task instances can be modeled from task instances to data elements. Array data accesses are by affine constraints of corresponding iterators of the two modeled as a mapping from the iteration domain to the dependent tasks as dependent relation. If the dependence data domain. If the mapping is in an affine form, it can be relation between the two iterators is not affine, either a modeled by a polyhedral matrix. Otherwise, we assume that segmented affine form or an affine hull can be used to specify possible ranges of the nonaffine accesses (such as indirect the dependence in an approximate way (See Listing 6). access) are bounded. We model the possible range of each ffi ffi Apolyhedralset object specifies a series of linear con- nona ne access by its a ne (or rectangular) hull in the straints of scalar variables in the data set of the application. multidimensional array space, which can also be expressed The scalar variables can be the iterators (including loop as a polyhedral matrix (class polyhedral set). A body pointer iterators and data subscripts) and the loop-independent (tldm access body ∗) for an access object is kept to refer to the parameters. Each constraint is a linear inequality or equality access reference in the detailed specification (See Listing 5). of these scalar variables, and is modeled as an integer vector The dependence set specifies timing precedence relations consisting of the linear coefficients for the scalar variables between task instances of the same or different tasks. and a constant term and an inequality/equality flag. Multiple A dependence relation from (task0, instance0) to (task1, instance1) imposes a constraint in task scheduling that constraints form an integer matrix (See Listing 7). (task0, instance0) must be executed before (task1, instance1). Figure 4 shows examples of polyhedral matrices for itera- The TLDM specification supports explicit dependence tion domain, data domain, data access, and task dependence, imposed by specified tldm dependence objects and implicit respectively. The first row of the matrix represents the list dependence embedded in the data access objects. Implicit of iterators and parameters (variables), where A0andA1 dependence can be analyzed statically by a compiler or are the dimensions of array A andthe#and$columns optimizer to generate derived tldm dependence objects. The are the constant terms and inequality flags, respectively. The explicit dependence specification provides the designer with following rows of the matrices represent linear constraints the flexibility to add user-specified dependence to help the (polyhedral matrix), where the elements of the matrices are compiler deal with complex array indices. User-specified the linear coefficients of the variables. For example, the dependence is also a key factor in relieving designers from the second constraint of the iteration domain case is -i+0j+N- limitations of dynamic single assignment, while maintaining 1≥0(i≤N-1). 10 Journal of Electrical and Computer Engineering
Iteration Data Dependence domain domain Access
→ for i=0..N-1 A[N][N] A[i][i+1] task1 , for j=0..N-1 task2 ...
i j N # $ A0 A1N # $ A0 A1 i ##$$t1:i t2:i t1: i 1 0 0 0 ≥ 1 0 0 0 ≥ 10−1 0 = 1 −1 0 1 = ≥ ≥ −1 0 1 −1 −1 0 1 −1 0 1 −1 −1 = 0 0 −1 0 = 0 1 0 0 ≥ 0 1 0 0 ≥ 0 −1 1 −1 ≥ 0 −1 1 −1 ≥
Figure 4: Polyhedral matrix representations.
class tldm access { tldm data∗ data ref ; // the data accessed bool is write; // write or read access polyhedral set iterators to subscripts; // access range tldm access body∗ body; };
Listing 5
class tldm dependence { (task) tldm task∗ task0; [data] tldm task∗ task1;
Listing 8
class polyhedral set { int n iterator; iteration boundaries, some data items can be used as param- int n parameter; ∗ eters in specifying the iteration range by arrow notations. vector [parameter list] -> {subscript vector range}; [parameter list] -> : (task1); // while (res > 0.1) {res = task2();} [res] -> Listing 9 input data list -> (task name : iterator vector) -> output data list; e.g. // A[i][j], B[i][j] -> task1 -> C[j][i] [A : i, j], [B : i, j] -> (task1 : i, j) -> [C : i, j] //A[i][0],...,A[i][N]-> task2 -> B[2∗ i+1] [A:i,0..N]-> (task1 : i) -> [B : 2∗ i+1] Listing 10 definitions into the body of the parent task. Data objects can as a range of iterators. An additional iterator (t) is introduced also be defined in the task bodies to become local data objects to the “while” loop to distinguish the iterative instances of the (See Listing 12). task. If we do not want to impose an upper bound for the iterator t, an unbounded iterator t is supported as in 4.4. Examples of TLDM Modeling. Tiled Cholesky is an ex- Listing 19.Atldm expression exe condition is built to model ample that is provided by Intel CnC distribution [35]. List- the execution condition of the “while” loop, which is a ings 13, 14 and 15 show the sequential C-code specification logic expression of tldm data convergence data. Concep- and TLDM modeling of the tiled Cholesky example. In this tually, the instances of the “while” loop are possible to program we assume that tiling factor p is a constant. execute simultaneously in the concurrent specification. CnC The TLDM built from the tiled Cholesky example is imposes dynamic single-assignment (DSA) restrictions to shown in Listing 14. The data set, task set, iteration domain, avoid nondeterministic results. But DSA restriction requires and access set are modeled in both C++ and textual TLDM multiple duplications of the iteratively updated data in specifications. the convergence iterations, which leads to a large or even Listings 16 and 17 show how our TLDM models the unbounded array capacity. Instead of the DSA restriction in nonaffine iteration boundaries. The nonaffine form of outer loop iterators and loop-independent parameters are modeled the specification, we support the user-specified dependence as a new pseudo-parameter. The pseudo-parameter non- on reasonably constrained scheduling of the “while” loop Affine(i) in Listing 17 is embedded in the polyhedral model instances to avoid the data access conflicts. Because the of the iteration domain. The outer loop iterators (i) are convergence condition will be updated and checked in each associated with the pseudo-parameter as an input variable. In “while” loop iteration, the execution of the “while” loop this way we retain the possibility of parallelizing the loops must be done in a sequential way. So we add dependence with nonaffined bounded iterators and keep the overall from task1 (task name0 : iterator vector) -> (task name1 : iterator vector e.g. // task1 -> task2 Listing 11 TLDM Listing 12 Implementation Listings 20 and 21 show an indirect access example. Many Figure 5: TLDM-based design flow. nonaffine array accesses have their ranges of possible data units. For example, in the case of Listing 20, the indirect access has a preknown range (x<= idx[x] <= x+M). We can ffi conservatively model the a ne or rectangular hull of the order of task instances in sequential program language, our possible access range in tldm access objects. In Listing 21, the TLDM can efficiently support selection of the proper execu- first subscript of the array Aisnotaspecificvaluerelatedto tion order for task instance to reduce the buffer requirement ffi = the iterators, but a specific a ne range of the iterators (j < between tasks. Listing 22 shows the example program to be A0 < j+M). optimized. The TLDM specification for this example is shown in 4.5. Demonstrative Design Flow Using TLDM. The detailed Figure 6. Two tasks are generated with their iteration do- design optimizations from TLDM specification are beyond mains. Four affine accesses to array A are notated in the the scope of this paper. We show a simple TLDM-based TLDM graph. Three sets of data dependence are notated design flow in Figure 5 and demonstrate how buffer size is in the TLDM graph: two intratask dependences (dep0 and reduced by task instance rescheduling as an example of task- dep1), and one intertask dependence (dep2). level optimizations using TLDM. A task-level optimization To evaluate the buffer size needed to transfer temporary flow determines the task-level parallelism and pipelining data from task t0 to t1, we need to analyze the data element between the task instances, order of the instances of a task, access pattern of the two tasks. As the execution order defined and the interconnection buffers between the tasks. After task- in the sequential C program, Figures 7(a) and 7(b) show the level optimization, parallel hardware modules are generated access order of the data t0 writes and t1 reads, respectively. A0 automatically by high-level synthesis tools. RTL codes for and A1 are array subscripts in the reference (like A[A0][A1]). both the hardware modules and their interconnections are The direction of the intratask data dependence is also shown integrated into a system RTL specification and then synthe- in the two subgraphs. As we can see in the two illustrations, sized into the implementation netlist. the access orders are not matched: one is row by row and the Different task instance scheduling will greatly affect the other is column by column. To maintain all the active data buffer size between the tasks. Instead of a fixed execution between the two tasks, an N×Nbuffer is needed. Journal of Electrical and Computer Engineering 13 1 int i, j, k; 2data type A[p][p][p+1]; 3for(k=0; k Listing 13:ExampleC-codeoftiledCholesky. dep0: t0→ t0 5. Advantages of TLDM Task-level hardware synthesis requires a desired concurrency 1iN≤≤ t0 specification model that is (i) powerful enough to model 0jN≤≤ A[i][j] all the applications in the domain, (ii) simple enough to be used by domain experts, independent of the implementation A[i-1][j] details, (iii) flexible enough for integrating diverse com- dep2: A putations (potentially in different languages or stages of t0 t0: A[i][j] A0 A0 t1: A[j + 1][i] (N,N) (N,N) Dep. Dep. A1 A1 (a) (b) A0 t0: A[i][j] (N,N) t0< i=1..N, j=0 > Col. 0 write t1< i=0, j=N-1..1 > Col. 0 read t0< i=1..N, j=1 > Col. 1 write t1< i=1, j=N−1..1 > Dep. Col. 1 read t0< i=1..N, j=2 > ... A1 Time (c) (d) Figure 7: (a) Initial access order of t0 writes. (b) Initial access order of t1 reads. (c) Optimized access order of t0 writes. (d) Optimized task instance scheduling for buffer reduction. int A[N]; // input TLDM for (i=0; i Process network Figure 8: Implicit parallelism (TLDM) versus explicit parallelism (process network). the buffer size between tasks by reordering the execution order constraints in sequential languages. This offers the order of the loop iterations. But these optimizations are possibility and convenience of performing task-level opti- relatively hard for a KPN-based flow because the optimizer mizations without touching module-level implementation. needs to analyze deep into the process specification and in- teract with module-level implementation. However, in our 5.2. Heterogeneous Platform Supports. Heterogeneous plat- TLDM specification, necessary task-instance-level depen- forms, such as the customized heterogeneous platform being dence information is explicitly defined without redundant developed by CDSC [9, 36], have gained increased attention Journal of Electrical and Computer Engineering 15 A for (i =0; ...) task1 for (j= 0; ...) task1(B[i][j]← A[i][j]) C TLDM for (i =0; ...) task3 for (j= 0; ...) task3(D[j][i]← C[j][i]) D KPN Figure 9: Support for task rescheduling. in high-performance computation system design because IPs) are coordinated with a common TLDM model. For each of the benefits brought about by efficient parallelism and execution of task instance in TLDM, the module-level func- customization in a specific application domain. Orders-of- tion is called once for C++ and CUDA cases, or a synchro- magnitude efficiency improvement for applications can be nization cycle by enabled and done signals is performed achieved in the domain using domain-specific computing for the RTL case, which means the RTL IP has processed platforms consisting of customizable processor cores, GPUs, one image. The related input data and space for output and reconfigurable fabric. data are allocated in a unified and shared memory space, CnC is essentially a coordination language. This means so that the task-level data transfer is transparent to the that the CnC specification is largely independent of the inter- module designers; this largely simplifies the module design nal implementation of a step collection, and individual in a heterogeneous platform. step collections could be implemented in different lan- To map the application in Figure 10 onto a heteroge- guages/models. This makes integrating diverse computa- neous platform, each task in the TLDM specification can tional steps simpler for users, compared to rewriting each have multiple implementation candidates—such as multi- step into a common concurrent specification. The module- core CPU and many-core GPU and FPGA versions. These level steps could be implemented in C/C++/Java for GPPs, module-level implementation candidates share the same CUDA/OpenCL for GPUs, or even HDL such as VHDL and TLDM specification. A synthesizer and simulator can be Verilog for hardware accelerator designs. In addition, because used to estimate or profile the physical metrics for each task CnC is a coordination description supporting different im- and each implementation candidate. According to this mod- plementations, it naturally supports an integrated flow to ule-level information and the task-level TLDM specification, map domain-specific applications onto heterogeneous plat- efficient task-level allocation and scheduling can be per- forms by selecting and refining the module-level implemen- formed to determine the architecture mapping and optimize tations. Our TLDM model inherits this feature directly from the communication. CnC. We take medical image processing applications as an 5.3. Suitability for Hardware Synthesis. Our TLDM is based example. There are three major filters (denoise, registration, on the Intel CnC model. Task, data, and iterator in TLDM and segmentation) in the image processing pipeline. An are the counterparts of step, data item, and control tag in image sequence with multiple images will be processed by CnC, respectively. Regular tasks and data are specified in the three filters in a pipelined way. At the task level, we a compact and iterative form and indexed by iterators and model each filter as a task, and each task instance processes array subscripts. Both task instances in TLDM and pre- one image in our TLDM model. The task-level dataflow is scribed steps in CnC need to wait for all the input data to be specified in the TLDM specification, which provides enough available in order to start the execution and do not ensure information for system-level scheduling. Module-level im- the availability of the output data until the execution of plementation details are embedded in module-level speci- the current instance is finished. The relative execution order fications, which can support different languages and plat- of task and step instances is only constrained by true data forms. In the example shown in Figure 10, we are trying dependence, not by textual positions in sequential programs. to map three filters onto a FPGA+GPU heterogeneous However, there are also great differences between TLDM and platform. One typical case is that denoise is implemented CnC. By restricting the general dynamic behavior allowed by in an FPGA by a high-level synthesis flow from C++ CnC, TLDM is more suitable to hardware synthesis for most specification, registration is mapped onto a GPU, and practical applications, and it differs from the general CnC in segmentation is specified as a hardware IP in the FPGA. the following five aspects. Module specifications using C++ (for HLS-FPGA synthesis), (1) CnC allows dynamically generating step instances; CUDA (for GPU compiling), and RTL (for hardware design this is hard to implement in hardware synthesis. In addition, 16 Journal of Electrical and Computer Engineering tldm app tiled cholesky; // iterator variables tldm data iterator i (“i”); // scalar variable tldm data iterator j (“j”); tldm data iterator k(“k”); // environment parameters tldm data param p(“p”); // array A[p][p][i + 1] tldm data array A(“A”,3); // n dimension = 3 array A.insert index constraint(0, “>=”, 0 ) ; array A.insert index constraint(0, “<”, p); array A.insert index constraint(1, “>=”, 0 ) ; array A.insert index constraint(1, “<”, p); array A.insert index constraint(2, “>=”, 0 ) ; array A.insert index constraint(2, “<”, p+1); // attach data into tldm application tiled cholesky.attach data(&iterator i); tiled cholesky.attach data(&iterator j); titled cholesky.attach data(&iterator k); tiled cholesky.attach data(¶m p); tiled cholesky.attach data(&array A); // iteration domain of task seq cholesky tldm iteration domain id k; id k.insert iterator(iterator k); // “k” id k.insert affine constraint(“k”,1, “>=”, 0 ) ; / / k ∗1 >= 0 id k.insert affine constraint(“k”,-1, “p”, 1, “>”,0); // -k+p > 0 // accesses A[k][k][k+1] tldm access acc A0 (&array A, WRITE); acc A0.insert affine constraint(“A(0)”,1, “k”,-1, “=”, 0 ) ; / / A 0 = k acc A0.insert affine constraint(“A(1)”,1, “k”,-1, “=”, 0 ) ; / / A 1 = k acc A0.insert affine constraint(“A(2)”,1, “k”,-1, “=”, 1 ) ; / / A 2 = k+1 // accesses A[k][k][k] tldm access acc A1 (&array A, READ); acc A1.insert affine constraint(“A(0)”,1, “k”,-1, “=”, 0 ) ; acc A1.insert affine constraint(“A(1)”,1, “k”,-1, “=”, 0 ) ; acc A1.insert affine constraint(“A(2)”,1, “k”,-1, “=”, 0 ) ; // task seqCholesky tldm task seq cholesky(“seqCholesky”); seq cholesky.attach id(&id k); seq cholesky.attach access(&acc A0); seq cholesky.attach access(&acc A1); seq cholesky.attach parent(NULL); tiled cholesky.attach task(&seg cholesky); // iteration domain of task tri solve tldm iteration domain id kj = id k.copy(); // 0 <= k < p id kj.insert iterator(iterator j); // “j” id kj.insert affine constraint(“j”,1, “k”, -1, “>=”, 1 ) ; / / j - k >= 1 id kj.insert affine constraint(“j”,-1, “p”, 1, “>”, 0 ) ; / / - j + p > 0 // accesses A[j][k][k + 1] Listing 14: Continued. Journal of Electrical and Computer Engineering 17 tldm access acc A2 = acc A0.copy(); acc A2.replace affine constraint(“A(0)”,1, “j”, -1, “=”, 0 ) ; / / A 0 = j // accesses A[j][k][k] tldm access acc A3 = acc A1.copy(); acc A3.replace affine constraint(“A(0)”,1, “j”, -1, “=”, 0 ) ; / / A 0 = j // accesses A[k][k][k + 1] tldm access acc A4 = acc A1.copy(); acc A4.replace affine constraint(“A(2)”,1, “k”, -1, “=”, 1 ) ; / / A 2 = k+1 // task TriSolve tldm task tri solve(“TriSolve”); tri solve.attach id(&id kj); tri solve.attach access(&acc A2); tri solve.attach access(&acc A3); tri solve.attach access(&acc A4); tri solve.attach parent(NULL); tiled cholesky.attach task(&tri solve); // iteration domain of task update tldm iteration domain id kji = id kj.copy(); id kji.insert iterator(iterator i); // “i” id kji.insert affine constraint(“i”,1, “k”, -1, “>=”, 1 ) ; / / i - k >= 1 id kji.insert affine constraint(“i”,-1, “j”,1, “>=”, 0); // -i+j >= 0 // accesses A[j][i][k + 1] tldm access acc A5 = acc A2.copy(); acc A5.replace affine constraint(“A(1)”,1, “i”, -1, “=”, 0 ) ; / / A 1 = i // accesses A[j][k][k + 1] tldm access acc A6 = acc A4.copy(); acc A6.replace affine constraint(“A(0)”,1, “j”, -1, “=”, 0 ) ; / / A 0 = j // accesses A[i][k][k + 1] tldm access acc A7 = acc A4.copy(); acc A7.replace affine constraint(“A(0)”,1, “i”, -1, “=”, 1 ) ; / / A 0 = i // task Update tldm task update (“Update”); update.attach id(&id kji); update.attach access(&acc A5); update.attach access(&acc A6); update.attach access(&acc A7); update.attach parent(NULL); tiled cholesky.attach task(&update); // dependence: TriSolve Listing 14: C++ TLDM specification of tiled Cholesky. data collection in CnC is defined as unbounded: if a new data would need to synthesize hardware to (i) generate the control tag (like array subscripts) is used to access the data collection, tags in some sequence and (ii) store the control tags till a new data unit is generated in the data collection. In TLDM, the prescribed step instances are ready to execute. Such iteration domain and data domain are statically defined, and an implementation would be inefficient for such a simple the explicit information of these domains helps to estimate program. TLDM works around this issue by introducing the corresponding resource costs in hardware synthesis. Data the concept of iteration domains. Iteration domains specify domain in TLDM is bounded: out-of-bounds access will be thatallcontroltagsaregeneratedasanumericsequence forbidden in the specification. with a constant stride. This removes the need to synthesize Consider the CnC specification for the vector addition hardware for explicitly generating and storing control tags. example in Figure 11. The environment is supposed to (2) In CnC, input/output accesses to the data collections generate control tags that prescribe the individual step are embedded implicitly in the step specification. TLDM instances that perform the computation. Synthesis tools adopts explicit specification for inter-task data accesses. In 18 Journal of Electrical and Computer Engineering // DNS.cpp : C++ (HLS design flow) ∗ ∗ Input_img void denoise(double img_in, double img_out) { Denoise #pragma hls interface-memory img_in #pragma hls interface-memory img_out ... Denoised_img // module-level detailed spec. } Registration // REG.cu : CUDA (GPU design flow) __global__ void RegKernel( Registered_img double ∗∗img_in, double img_out) { Segmentation int idx= blockIdx.x ∗ blockDim.x+threadIdx.x; // module-level detailed spec. Contour_img } // SEG.v : verilog (RTL design flow) module seg_top ( // medical_image.tldm : TLDM input clk, rst, enable, output done, Figure 10: Heterogeneous computation integration using TLDM. Environment int A[N]; // input int B[N]; // output for (i=0; i Sequential code A CnC TLDM our simple example in Figure 12, the task task reads data Figure 12: TLDM versus CnC: iterator domain and access pattern from data collections A, but for a given step instance i, task- are explicitly specified in TLDM. level CnC does not specify the exact data element to be read. This makes it hard for hardware synthesis to get an efficient implementation without the access information. TLDM specifies the access patterns directly, which can be used for is the coefficient associated with each variable. Standard dependence analysis, reuse analysis, memory partitioning, polyhedral libraries can be used to perform transformation and various memory-related optimizations. on these integer sets and to analyze specific metrics such as (3) In the static specification for the domain and map- counting the size. The polyhedral model framework helps ping relations, a polyhedral model is adopted in TLDM to us to utilize linear/convex programming and linear algebra model the integer sets and mappings with linear inequality properties in our synthesis and optimization flow. and equations. As shown in Figure 13, the range of the itera- Note that TLDM introduces a hybrid specification to tors can be defined as linear inequalities, which can be finally model the affine parts in the polyhedral model and non- expressed concisely with integer matrices. Each row of the affine parts in the polyhedral approximation or nonaffine matrix is a linear constraint, and each column of the matrices execution conditions. Compared to the traditional linear Journal of Electrical and Computer Engineering 19 <> :: (top level) // top level task, single instance { <> :: [data type p]; // no domain for scalar variable // data A[p][p][p+1] [p] -> {0<=a0 :: [array A]; // task seqCholesky [p] -> :: (task1) {body id(“seqCholesky”);}; [A : k, k, k] -> (task1:k) -> [A: k,k,k+1]; //task TriSolve [p] -> :: (task2) {body id(“TriSolve”);}; // task Update [p] -> :: (task3) {body id(“Update”);}; [A: j,k,k+ 1],[A: i,k,k+ 1]-> (task3 : k,j,i) -> [A: j,i,k+1]; //dependence (task2 : k,j) -> (task3 : k,j,(k+1)..j) }; Listing 15: Textual TLDM specification of tiled Cholesky. for (i=0; i Listing 16: Example C-code of nonaffine iterator boundary. tldm iteration domain id ij; id ij.insert iterator(iterator i); // “i” id ij.insert iterator(iterator j); // “j” id ij.insert affine constraint(“i”,1, “>=”, 0 ) ; / / i ∗1 >= 0 id ij.insert affine constraint(“i”,-1, “N”,1, “>”,0); // -i+N > 0 id ij.insert affine constraint(“j”,1, “>=”, 0 ) ; / / i ∗1 >= 0 id ij.insert affine constraint(“j”, -1, “nonAffine(i)”,1, “>”, 0 ) ; / / - j + ( i ∗i)>0 [N] -> Listing 17: TLDM specification of nonaffine iterator boundary. while (!convergence) for (i=0; i Listing 18:ExampleC-codeofconvergencealgorithm. 20 Journal of Electrical and Computer Engineering tldm data convergence data(“convergence”); tldm iteration domain id ti; id ti.insert iterator(iterator t); // “t” for the while loop id ti.insert iterator(iterator i); // “i” id ti.insert affine constraint(“i”,1, “>=”, 0 ) ; / / i ∗1 >= 0 id ti.insert affine constraint(“i”,-1, “N”, 1, “>”, 0); // -i+N > 0 id ti.insert affine constraint(“t”,1, “>=”, 0 ) ; / / t >= 0 tldm expression exe condition(&iterator t, “!”, &convergence data); id ti.insert exe condition(&exe condition); // non-DSA access: different iterations access the same scalar data unit tldm access convergence acc (&convergence data, WRITE); tldm task task1(“task1”); seq cholesky.attach id(&id ti); seq cholesky.attach access(&convergence acc); // dependence is needed to specify to avoid access conflicts tldm dependence dept(&task1, &task1); // dependence : task1 Listing 19: TLDM specification of convergence algorithm. for (i=0; i Listing 20: Example C-code of indirect access. i N #$ 1 00≥ ijMNp0() #$ −11 0 > 1 0 0 0 0 0 ≥ −1 0 0 1 0 0 > ∩ for (i=0;i Figure 13: Polyhedral representation for iterator domain. Journal of Electrical and Computer Engineering 21 int A[N]; while ( !converge) converge = task1(A[0..N] ← A[0..N]); Sequential code [int A]; CnC TLDM Figure 14: DSA (CnC) versus user-specified dependence (TLDM). tldm data array A(“A”); // accesses A[idx[j]][i] // implication: x<= idx[x] < x+M tldm access indirect acc(&array A, READ); indirect acc.insert affine constraint(“A(0)”,1, “j”, -1, “>=”, 0 ) ; / / A 0 >= j indirect acc.insert affine constraint(“A(0)”,1, “j”, -1, “M”,-1, “<”, 0 ) ; / / A 0 < j+M indirect acc.insert affine constraint (“A(1)”,1, “ i”,-1, “=”, 0 ) ; / / A 1 = i [A : j..(j + M), i] -> (task2 : i, j) Listing 21: TLDM specification of indirect access. in an array, we support a polyhedral-based form to specify the affine (rectangle) hull of the possible accessed array ele- 1for(i=1; i<=N; i++) 2for(j=0; j<=N; j++) ments. With this approach, we can conservatively and ef- 3 t0: A[i][j] = A[i-1][j] + ...; ficiently handle the possible optimizations for many of the ffi 4for(i=0; i<=N; i++) nona ne array accesses—such as indirect accesses (shown 5for(j=N-1; j>=0; j- -) in Listing 21). 6 t1: A[j][i] = A[j + 1][i] +...; (4) CnC restricts dynamic single assignment (DSA) in the data access specification to achieve the intrinsic deterministic characteristics in a concurrent specification. But practical Listing 22: Example application for buffer reduction. experience shows that the DSA restriction is not convenient for designers when specifying their application for efficient hardware implementation. Memory space reuse in CnC is polyhedral model, we support the extensions in the following forbidden, which makes the designer lose the capability three aspects. First, if the boundaries of the iterator are of specifying memory space reuse schemes. In addition, in the nonaffine form of outer iterators and task-independent loop-carried variables need to maintain one duplication for parameters, parallelism and execution order transformation each iteration; this leads to an unbounded data domain for this non-affine-bound iterator is still possible. We when the iteration domain is unbounded. For example, in introduce pseudoparameters to handle these inner-loop- Figure 14 array A needs to have different copies for each independent nonaffine boundaries. Second, if the bound- iteration in the CnC specification; this leads to an additional aries of the iterator are dependent on data items generated dimension for A to index the specific copy. In TLDM we by the task itself—for example, in the case of a convergent do not enforce the DSA restriction in the specification, algorithm—nonaffine data-dependent execution conditions but DSA is still recommended. For those cases in which are specified explicitly by tldm expression objects, which are designers intend to break DSA restriction, write access easy to analyze in the optimization and hardware generation conflicts need to be handled by the designer using our user- flows. Third, for nonaffine array accesses or block data access specified dependence specifications to constrain the possible 22 Journal of Electrical and Computer Engineering 1colbuf[N]; 2for(i=0; i<=N; i++) 3 { // pragma hls loop pipeline 4 t0(1..N, i, col buf); // write one column of array A[N][N] 5 t1(i, N-1..1, col buf); // read one column of array A[N][N] 6 } Listing 23: Generated HW module after task-level scheduling. conflict accesses into different time spans. In Figure 14 SDF has partially scalable data parallelism because the data dependence (task1 : t)->task(1 : t+1) ensures the sequential parallelism can be relatively easily exploited by automated execution of the convergence iterations, which can guarantee transformations on the specification like StreamIt. Reorder- the correctness without DSA constraints. ing the task instance can help to improve the task parallelism (5) The traditional CnC specification is not scalable and memory usage, which can only be supported by CnC becauseitdoesnotsupportthehierarchyofsteps.Inour and TLDM. CSP and CnC support dynamic generation TLDM specification, task graphs are built in a hierarchical of task instances, and the dynamic behavior is hard to way. Designers can specify a set of highly coupled subtasks analyze statically by a hardware synthesizer. On the contrary, into a compound task. Hardware synthesis flow can select SDF, KPN, and TLDM are limited in multicore processor- various methodologies to perform coarse-grained or fine- based platforms because the user needs to transform the grained optimizations based on the task hierarchy. In addi- dynamic specification into equivalent static specification. tion, specific synthesis constraints or optimization targets CSP (Metropolis), PSM and TLDM have explicit specifica- can be applied to different compound tasks according to tion for the interfaces and accesses between tasks, which can the computation characteristics of the compound tasks. Data fully encapsulate the implementation details of the task body elements can also be defined in a smaller scope for better for different platforms. In SDF, KPN, and CnC, the access locality in memory optimizations. information is embedded in the task body specification, and additional processes are needed to extract the information ff 5.4. Discussion and Comparison. Table 2 compares some from di erent specifications of heterogeneous platforms. widely used concurrency specifications in different aspects. From the comparison, we can see that the proposed TLDM CSP is the general underlying MoC of the event-driven is the most suitable specification for hardware synthesis, Metropolis framework. The PSM model is specified in a especially for data-intensive applications. structural language, SpecC, and used in the SCE framework. SDF and KPN are the most popular models for data- 6. Conclusion and Ongoing Work intensive applications and have been adopted by StreamIt and Daedalus projects. CnC and TLDM were recently This paper proposes a practical high-level concurrent speci- proposed to separate implementation optimizations from fication for hardware synthesis and optimization for data concurrency specifications. processing applications. In our TLDM specification, paral- In Table 2 the first column lists the features for the lelism of the task instances is intrinsically modeled, and the concurrency specifications. In the table, solid circles, hollow dependence constraints for scheduling are explicit. Com- circles, and blanks mean full support, partial support, and pared to the previous concurrent specification, TLDM aims no support, respectively. All of these models are designed to specify the applications in a static and bounded way to support data streaming applications because they are the with minimal overconstraints for concurrency. A polyhedral most popular application types in electronic system design. model is embedded in the specification of TLDM as a For control-intensive applications, CSP and CnC can sup- standard and unified representation for iteration domains, port general irregular (conditional) behavior like dynamic data domains, access patterns, and dependence. Extensions instance generation and conditional execution; PSM has an for nonaffine terms in the program are comprehensively intrinsic support for concurrent FSMs; KPN and TLDM considered in the specification as well to support the analysis support data-dependent execution, but their process instance and synthesis of irregular behavior. Concrete examples show is statically defined; SDF has regular data accesses and actor the benefits of our TLDM specification in modeling a task- firing rates, which make it inefficient to model conditional level concurrency for hardware synthesis in heterogeneous execution. CSP is nondeterministic because of its general platforms. We are currently in the process of developing the trigger rules; PSM has parallel processes that can write the TLDM-based hardware synthesis flow. same data; other models are all deterministic. Hierarchy is fully supported by CSP, PSM, and TLDM, and hierarchical Acknowledgments CnC is under development now. In CSP, PSM, and KPN, data parallelism has to be specified explicitly because they This work is partially funded by the Center for Domain- do not have the concept of task instance as CnC and TLDM. Specific Computing (NSF Expeditions in Computing Award Journal of Electrical and Computer Engineering 23 Table 2: Summary of the concurrency specifications. CSP PSM SDF KPN CnC TLDM Data streaming application •••••• Control-intensive application •• ◦•◦ Nondeterministic behavior •◦ Hierarchical structure •• ◦• Scalable data parallelism ◦•• Task instance reordering •• HW Synthesis ••• • Multicore platform ••◦◦•◦ Heterogeneous platform ••◦◦◦• CCF-0926127) and the Semiconductor Research Corpora- [16] T. Murata, “Petri nets: properties, analysis and applications,” tion under Contract 2009-TJ-1984. The authors would like Proceedings of the IEEE, vol. 77, no. 4, pp. 541–580, 1989. to thank Vivek Sarkar for helpful discussions and Janice [17] Petri net, 2010, http://en.wikipedia.org/wiki/Petri net. Wheeler for editing this paper. [18] A. Davare, D. Densmore, T. Meyerowitz et al., “A next-gen- eration design framework for platform-based design,” DVCon, 2007. References [19]J.Buck,S.Ha,E.A.Lee,andD.G.Messerschmitt,Readings in Hardware/Software Co-Design. Ptolemy: A Framework for [1] A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. Simulating and Prototyping Heterogeneous Systems,Kluwer D. Gajski, and J. Teich, “Electronic system-level synthesis Academic, Norwell, Mass, USA, 2002. methodologies,” IEEE Transactions on Computer-Aided Design [20] G. Kahn, “The semantics of a simple language for parallel pro- of Integrated Circuits and Systems, vol. 28, no. 10, pp. 1517– gramming,” in Proceedings of the IFIP Congress,J.L.Rosenfeld, 1530, 2009. Ed., North-Holland, Stockholm, Sweden, August 1974. [2] A. Sangiovanni-Vincentelli, “Quo vadis, SLD? Reasoning about the trends and challenges of system level design,” Pro- [21] H. Nikolov, M. Thompson, T. Stefanov et al., “Daedalus: ceedings of the IEEE, vol. 95, no. 3, Article ID 4167779, pp. 467– toward composable multimedia MP-SoC design,” in Proceed- 506, 2007. ings of the 45th Design Automation Conference (DAC ’08),pp. [3] “An independent evaluation of the AutoESL autopilot high- 574–579, ACM, New York, NY, USA, June 2008. level synthesis tool,” Tech. Rep., Berkeley Design Technology, [22] W. Thies, M. Karczmarek, and S. Amarasinghe, “Streamit: a 2010. language for streaming applications,” in Proceedings of the 11th [4] C. A. R. Hoare, “Communicating sequential processes. Com- International Conference on Compiler Construction (CC ’02),R. mun,” Communications of the ACM, vol. 21, no. 8, pp. 666– Horspool, Ed., vol. 2304 of Lecture Notes in Computer Science, 677, 1978. pp. 49–84, Springer, Berlin, Germany, 2002. [5]E.A.LeeandT.M.Parks,“Dataflowprocessnetworks,”Pro- [23] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and ceedings of the IEEE, vol. 83, no. 5, pp. 773–801, 1995. A. Sangiovanni-Vincentelli, “A formal specification model for [6] D. Harel, “Statecharts: a visual formalism for complex sys- hardware/software codesign,” in Proceedings of the Interna- tems,” Science of Computer Programming, vol. 8, no. 3, pp. 231– tional Workshop on Hardware/Software Co-Design, 1993. 274, 1987. [24]T.Kangas,P.Kukkala,H.Orsilaetal.,“UML-basedmul- [7]E.A.LeeandD.G.Messerschmitt,“Synchronousdataflow,” tiprocessor SOC design framework,” ACMTransactionson Proceedings of the IEEE, vol. 75, no. 19, pp. 1235–1245, 1987. Embedded Computing Systems, vol. 5, no. 2, pp. 281–320, 2006. [8] Intel—Concurrent Collections for C/C++: User’s Guide, 2010, [25] F. Balarin, M. Chiodo, and P. Giusto, Hardware-Software Co- http://software.intel.com/file/30235. Design of Embedded Systems: The POLIS Approach,Kluwer [9] J. Cong, G. Reinman, A. Bui, and V. Sarkar, “Customizable do- Academic, Norwell, Mass, USA, 1997. main-specific computing,” IEEE Design and Test of Computers, [26] C. Park, J. Jung, and S. Ha, “Extended synchronous dataflow vol. 28, no. 2, pp. 6–14, 2011. for efficient DSP system prototyping,” Design Automation for [10] E. A. Lee and A. Sangiovanni-Vincentelli, “A framework for Embedded Systems, vol. 6, no. 3, pp. 295–322, 2002. comparing models of computation,” IEEE Transactions on [27] S. Ha, S. Kim, C. Lee, Y. Yi, S. Kwon, and Y. P. Joo, “PeaCE: Computer-Aided Design of Integrated Circuits and Systems, vol. a hardware-software codesign environment for multimedia 17, no. 12, pp. 1217–1229, 1998. embedded systems,” ACM Transactions on Design Automation [11] SystemC, http://www.accellera.org/. of Electronic Systems, vol. 12, no. 3, 2007. [12] BlueSpec, http://bluespec.com/. [28] K. Strehl, L. Thiele, M. Gries, D. Ziegenbein, R. Ernst, and [13] FDR2 User Manual, 2010, http://fsel.com/documentation/ J. Teich, “FunState—an internal design representation for fdr2/html/fdr2manual 5.html. codesign,” IEEE Transactions on Very Large Scale Integration [14] ARC CSP model checking environment, 2010, http://cs.ade- (VLSI) Systems, vol. 9, no. 4, pp. 524–544, 2001. laide.edu.au/∼esser/arc.html. [29] K. Gruttner¨ and W. Nebel, “Modelling program-state ma- [15] R. Allen, Aformalapproachtosoftwarearchitecture,Ph.D. chines in SystemCTM,” in Proceedings of the Forum on Specifi- thesis, Carnegie Mellon, School of Computer Science, 1997, cation, Verification and Design Languages (FDL ’08), pp. 7–12, Issued as CMU Technical Report CMU-CS-97-144. Stuttgart, Germany, September 2008. 24 Journal of Electrical and Computer Engineering [30] R. Domer,¨ A. Gerstlauer, J. Peng et al., “System-on-chip envi- ronment: a SpecC-based framework for heterogeneous MPSoC design,” EURASIP Journal on Embedded Systems, vol. 2008, no. 1, Article ID 647953, 13 pages, 2008. [31] M. Fujita and H. Nakamura, “The standard SpecC language,” in Proceedings of the 14th International Symposium on System Synthesis (ISSS ’01), pp. 81–86, ACM, New York, NY, USA, 2001. [32] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, “Cilk: an efficient multithreaded runtime system,” in Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 207–216, August 1995. [33] Habanero-C project, 2011, https://wiki.rice.edu/confluence/ display/HABANERO/Habanero-C. [34] V. Cave, J. Zhao, J. Shirako, and V. Sarkar, “Habanero- Java: the new adventures of old x10,” in Proceedings of the 9th International Conference on the Principles and Practice of Programming in Java (PPPJ ’11), 2011. [35] Intel CnC distribution, 2011, http://software.intel.com/en-us/ articles/intel-concurrent-collections-for-cc/. [36] Center for Customizable Domain-Specific Computing (CDSC), http://cdsc.ucla.edu/. Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2012, Article ID 593532, 12 pages doi:10.1155/2012/593532 Research Article Selectively Fortifying Reconfigurable Computing Device to Achieve Higher Error Resilience Mingjie Lin,1 Yu Bai, 1 and John Wawrzynek2 1 Department of Electrical Engineering and Comouter Science, University of Central Florida, Orlando, 32816 FL, USA 2 Department of Electrical Engineering and Comouter Science, University of California at Berkeley, Berkeley, 94720 CA, USA Correspondence should be addressed to Mingjie Lin, [email protected] Received 11 September 2011; Revised 9 January 2012; Accepted 11 January 2012 Academic Editor: Deming Chen Copyright © 2012 Mingjie Lin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the advent of 10 nm CMOS devices and “exotic” nanodevices, the location and occurrence time of hardware defects and design faults become increasingly unpredictable, therefore posing severe challenges to existing techniques for error-resilient computing because most of them statically assign hardware redundancy and do not account for the error tolerance inherently existing in many mission-critical applications. This work proposes a novel approach to selectively fortifying a target reconfigurable computing device in order to achieve hardware-efficient error resilience for a specific target application. We intend to demonstrate that such error resilience can be significantly improved with effective hardware support. The major contributions of this work include (1) the development of a complete methodology to perform sensitivity and criticality analysis of hardware redundancy, (2) a novel problem formulation and an efficient heuristic methodology to selectively allocate hardware redundancy among a target design’s key components in order to maximize its overall error resilience, and (3) an academic prototype of SFC computing device that illustrates a 4 times improvement of error resilience for a H.264 encoder implemented with an FPGA device. 1. Introduction scaling, and even radiation hardening [6, 7]. For example, qualified versions of SRAM-based FPGAs, such as Xilinx’s With the advent of 10 nm CMOS devices and “exotic” nan- QPro [8], are commercially available for mitigating SEUs at odevices [1–3], error resilience is becoming a major concern the circuit level. More recently, hardware designers started for many mission- and life-critical computing systems. to employ information redundancy, hardware redundancy, Unfortunately, as these systems grow in both design size time redundancy, or a combination of these techniques in and implementation complexity, device reliability severely order to circumvent hardware defects at the device, circuit, diminishes due to escalating thermal profiles, process- and architectural levels. Specifically, the use of redundant level variability, and harsh environments (such as outer space, high-altitude flight, nuclear reactors, or particle hardware, data integrity checking, and data redundancy accelerators). Unlike design faults, device failures often are across multiple devices is gaining considerable popularity spatially probabilistic (the locations of hardware defects and [9–11]. In fact, several engineers and researchers [12, 13] design faults within a chip are unknown) and temporally have demonstrated that logic-level triple-modular redun- unpredictable [4, 5] (the occurrence time of hardware defects dancy (TMR) with scrubbing of the FPGAs programming are hard to foretell). (or configuration) data effectively mitigates the results of radiation-induced single-bit upsets (SBUs). However, most 1.1. Related Work. Traditionally, either special circuit tech- of these redundancy techniques detect and recover from nique or conservative design is employed to ensure correct errors through expensive redundancy statically, that is, the operation and to achieve high error resilience. Existing cir- allocation of hardware redundancy based on previously cuit techniques include guard banding, conservative voltage known or estimated error information, usually treating all 2 Journal of Electrical and Computer Engineering components within a system indiscriminately, and therefore and more powerful, will become increasingly more suitable incur huge area and performance penalties. hardware platforms to apply the SFC concept and methodol- Recognizing TMR is costly; a recent study from BYU [14] ogy for the improvements of fault-tolerance and computing proposed applying partial TMR method on the most critical robustness. Second, many newly emerging applications, such sections of the design while sacrificing some reliability of as data mining, market analysis, cognitive systems, and the overall system. Specially, they introduced an automated computational biology, are expected to dominate modern software tool that uses the partial TMR method to apply computing demands [17]. Unlike conventional computing TMR incrementally at a very fine level until the available applications, they typically process massive amounts of data resources are utilized, thus maximizing reliability gain for and build mathematical models in order to answer real- the specified area cost. Although with apparent conceptual world questions and to facilitate analyzing complex system, similarity to this work, there are several important dis- therefore tolerant to imprecision and approximation. In tinction between studies [14–16] and this one. First, while other words, for these applications, computation results the study [14] focuses primarily on TMR or partial one, need not always be perfect as long as the accuracy of the we consider more general case of selecting among n-nary computation is “acceptable” to human users [18]. modular redundancy (nMR) choices, where n = 3, 5, 7, .... The rest of the paper is organized as follows. Section 2 This generalization turned out to be important. In Section 7, overviews the proposed SFC framework and Section 3 we will show that in our target H.264 application, if more discusses its target applications. We then delve into more than TMR, such as 7-MR, is applied to certain components, detailed descriptions of criticality analysis and optimally the overall error resilience can be significantly improved. In allocating hardware redundancy with constrained hardware addition, we provide a stochastic formulation of maximizing allowance. Subsequently, Section 6 describes in detail our a system’s error resilience when hardware failures are spatially H.264 decoder prototype and numerous design decisions. probabilistic and temporally unpredictable. As a result, we can Finally, we present and analyze the error resilience results in principle obtain a mathematically provable optimal solu- from a SFC design against its baseline without hardware tion to maximizing the system’s overall error resilience while redundancy in order to demonstrate its effectiveness, with being constrained by a total available hardware redundancy. Section 7 concluding the paper. Finally, while the techniques studied in [14–16] are mostly specific to FPGA fabric, our proposed methodology can be readily introduced into logic synthesis for an application- 2. SFC Framework Overview specific integrated circuit (ASIC) design. There are two conceptual steps depicted in Figure 1 Identifying the most critical circuit structures, thus to achieve hardware-efficient reliable computing by the trading hardware cost for SEU immunity, is not a new idea. approach of selectively fortified computing (SFC): criticality In fact, Samudrala et al. proposed the selective triple modular probing and hardware redundancy allocation. Because there redundancy (STMR) method which uses signal probabilities is a fundamental tradeoff between hardware cost and com- to find the SEU-sensitive subcircuits of a design [15]. puting reliability, it is infeasible to construct a computing Morgan et al. have proposed a partial mitigation method system that can tolerate all the faults of each of its elements. based on the concept of persistence [16]. We approach the Therefore, the most critical hardware components in a criticality analysis differently in this study. Instead of being target device need to be identified and their associated logic circuit or FPGA fabric-specific, we analyze individual computation should be protected as a priority. In SFC, we system criticality by computing output sensitivity subjected develop methods to identify sensitive elements of the system to various amount of input perturbation probabilistically. whose failures might cause the most critical system failures, Therefore, we are able to not only distinguish critical and prioritize them based on their criticality to user perception noncritical components, but also quantitatively assigning and their associated hardware cost, and allocate hardware different criticality values, which leads to a more cost- redundancy efficiently and accordingly. effective allocation of hardware redundancy. This paper proposes a selectively fortified computing (SFC) approach to providing error resilience that preserves 3. Target Applications the delivery of expected performance and accurate results, despite of the presence of faulty components, in a robust and Many mission-critical applications, such as multimedia pro- efficient way. The key idea of SFC is to judiciously allocate cessing, wireless communications, networking, and recog- hardware redundancy according to each key component’s nition, mining, and synthesis, possess a certain degree of criticality towards target device’s overall error resilience.It inherent resilience, that is, when facing device or component leverages two emerging trends in modern computing. First, failure, their overall performance degrades gracefully [19– while ASIC design and manufacturing costs are soaring today 22]. Such resilience is due to several factors. First, the with each new technology node, the computing power and input data to these algorithms are often quite noisy and logic capacity of modern FPGAs steadily advance. As such, nevertheless these applications are designed to handle them. we anticipate that FPGA-like reconfigurable fabric, due to Moreover, the input data are typically large in quantity its inherent regularity and built-in hardware redundancy, and frequently possess significant redundancy. Second, the will become increasingly attractive for robust computing. algorithms underlying these applications are often statistical Moreover, we believe that modern FPGA devices, bigger or probabilistic in nature. Finally, the output data of these Journal of Electrical and Computer Engineering 3 0.95 Criticality analysis 0.43 0.8 Application Criticality map Place and route Discriminatively allocating hardware redundancy SFC reconfigurable device Figure 1: Conceptual flow of SFC (selectively fortified computing) methodology. applications are mostly for human consumption, therefore Finally, CABAC (context adaptive binary arithmetic coding) some imperfections can be tolerated due to limited human is used to entropy-encode the coefficients and other elements perception. We believe that these applications are ideal SFC of the bit-stream. Unlike the previous algorithms, CABAC is candidates because their overall system error resilience can sequential and control dominated. While it takes only a small potentially be significantly improved by selectively fortifying fraction of the execution (1%), CABAC often becomes the their key components. In this study, we illustrate our bottleneck in parallel systems due to its sequential nature. proposedSFCstrategybyfocusingonahighlyefficient 720 p To appreciate the limited resilience characteristics of the HD H.264 encoder [23]. We choose H.264 encoder because HD H.264 encoder, we first implement all of five com- it is widely used in many mission-critical instruments. ponents using a standard HDL synthesis flow with Xilinx Moreover, H.264 contains a variety of computational motifs, ISE Design Suite 13.0. Subsequently, errors are injected to from highly data parallel algorithms (motion estimation) to four key components—IME (integer motion estimation), control intensive ones (CABAC) as depicted in Figure 2. FME (fractional motion estimation), DCT/Quant (trans- Our baseline FPGA implementation consists of five form and quantization), and CABAC (context adaptive major functional components that account for more than binary arithmetic coding) separately through VPI inter- 99% of the total execution time. Among these functional faces (see Section 5.2). These four components are quite components, IME (integer motion estimation) finds the diverse and representative in their computing patterns. More closestmatchforanimageblockfromapreviousreference importantly, they together consume more than 90% of image and computes a vector to represent the observed total execution load and more than 75% of total energy motion. While it is one of the most compute intensive parts consumption [23]. For each of our target components X (I, of the encoder, the basic algorithm lends itself well to data II, III, and IV in Figure 2), we conduct 10000 fault injection parallel architectures. The next step, FME (fractional motion experiments for each different error rate r according to the estimation), refines the initial match from integer motion procedure outlined in Section 5.2.Atadifferent redundancy estimation and finds a match at quarter-pixel resolution. ratio wi, we repeat this procedure and measure the output FME is also data parallel, but it has some sequential deviation between the resulting image and its reference and dependencies and a more complex computation kernel that quantify the result error as in Section 5.3. More details of our makes it more challenging to parallelize. IP (intraprediction) measurement procedure can be found in Section 5. then uses previously encoded neighboring image blocks Before discussing the results presented in Figures 3 and 4, within the current image to form a prediction for the current we formally define several key parameters. Some of them will image block. Next, in DCT/Quant (transform and quantiza- be used in subsequent sections. Let hi denote the hardware tion), the difference between a current and predicted image cost of implementing the component i in our target design. block is transformed and quantized to generate quantized We define redundancy ratio wi = (hi − hi,0)/(hi,0), where hi,0 coefficients, which then go through the inverse quantization denotes the hardware cost of the component i without any and inverse transform to generate the reconstructed pixels. redundancy. 4 Journal of Electrical and Computer Engineering Video signal Output bitstream Preprocessing Output buffer − Transform Quantization Entropy coding IV III Inverse quantization Inverse transform + Deblocking filter Intraframe prediction Frame buffer MC prediction I, II Motion estimation Figure 2: Block diagram of a standard H.264 encoder. Figure 3 plots our measurement results for IME (inte- high error rate. Interestingly, as we increase hardware redun- ger motion estimation). Different curves represent output dancy ratio, the critical point noticeably shifts towards high responses for different hardware redundancy ratios. Intu- error rate. For our particular IME unit instance, diminish itively, the higher the value of wi is, the more computing return of hardware redundancy quickly appears as wi goes fortification the target device will possess. Roughly speaking, beyond 4, which is manifested by the flattening in its curve. wi = 2 is equivalent of a TMR (triple modular redundancy) Finally, at each wi, the middle portion of each curve in Figure 3 exhibits a strong linearity, thus can be accurately unit. The values of wi can be fractional because of partial redundancy as discussed in Section 6.1. In our experiments, approximated as a straight line. × −4 error injection rate is measured as in error/gate/clock-cycle. In Figure 4, we fix error inject rate at 3.0 10 We defer all discussions of error modeling and injection error/gate/clock-cycle and plot our measurement results for all four key components. The values of w can be fractional procedure to Section 5.2.NoteinFigure 3 that errors injected i because of partial redundancy as discussed in Section 6.1. at rate below 1 × 10−4 error/gate/clock-cycle do not impact In general, different component responds quite differently the output quality by more than 1%, revealing the inherent to the same error rate. As wi increases, the output error resilience of such applications to rare data errors. However, decreases consistently except for the component IV: CABAC. × −4 error rates beyond 2.0 10 error/gate/clock-cycle impact The relative small slopes of all curves in the middle portion image outputs significantly. We call this error rate as critical again display early occurrence of diminished return of point. In our set of experiments, we injected errors by hardware redundancy. The most important finding of this randomly selecting one of considered subcomponents and experiment is to discover that DCT is far more sensitive to flipping one bit at a randomly chosen location. Our IME hardware redundancy than other components, while CABAC unit becomes totally dysfunctional after an average of 67 show exactly the opposite. Intuitively, this means that if error injections when the error injection rate was as low facing hardware cost constrains, DCT should be assigned as 1.13 × 10−4 error/gate/clock-cycle. These results confirm with a higher priority of hardware fortification and will have that, without proper system design, we cannot expect IME to a larger benefit in improving the overall error resilience of produce useful results on unreliable hardware with relatively the whole system. Finally, these results show that without Journal of Electrical and Computer Engineering 5 10 The unit used here can be either gate count or physical 9 chip area. Furthermore, assuming all hardware defects are 8 distributed in an i.i.d. manner, that is, all hardware defects are independent and identically distributed, all following a 7 uniform distribution. Therefore, for any randomly occurring 6 device error, the probability for one specific component 5 ff K = Ci to be a ected is Ai/ j=1 Aj gi. Note that in our 4 proposed DFHS methodology, the random distribution of 3 hardware effects can be in other form and does not limit Normalized output error 2 the applicability of our proposed approach. In other words, if the randomly distributed hardware failures are more 1 concentrated (i.e., occurring with higher probability) in 12345678910 certain component i, we can accordingly increase Ci to Error injection rate (×10−4 error / gate/ clock-cycle) accommodate such changes. Finally, we define the error probability of each compo- wi= wi= 4 0 nent Ci to be ei = F(wi, ni), a function of two variables: wi wi= 2 wi= 6 and ni. wi denotes the hardware redundancy ratio defined in Section 3, which quantifies how much hardware redundancy ff Figure 3: Output error versus error injection rate at di erent wi for is built in the component C . Associated with each of the IME unit. i n functional components, there exist several choices of design alternatives having different hardware redundancy 10 ratios w (as defined in Section 3). Typically, higher wi values 9 intuitively indicate more error resilience for a particular component at the price of high hardware usage. n denotes 8 i the number of hardware defects occurring in the component 7 under consideration. 6 Given all the above definitions, further assuming error 5 propagation within this target system is multiplicative, the 4 objective of maximizing the overall error resilience of the G 3 target system , under the attack of N randomly located hardware defects, can be formulated as minimizing the Normalized output error 2 overall product of error probability and criticality among all 1 its components: 0 ⎛ ⎞ 01234567 K K N j Ai Hardware redundancy ratio (wi) = = ⎝ ⎠ E w, N ϕigiei ϕi K F wi, j , (1) i=1 i=1 j=1 k=1 Ak IV: CABAC II: FME III: DCT I: IME where w denotes w1, w2, ..., wK . Clearly, constrained by Figure 4: Output error versus hardware redundancy ratio wi for a total hardware resource H, the problem is to determine different units. which hardware redundancy ratio to select for each key component in order to achieve the greatest error resilience while keeping the total hardware cost within the allowable computing fortification, the inherent error resilience for amount, taking into consideration each component’s distinct these key components, although exists, is quite limited. criticality. This formulation of the problem leads to minimiz- ing the total system error probability E: 4. Stochastic Formulation of Error Resilience n arg min E w, N ,s.t. Ai ≤ H. (2) w ,i∈[1,K] For a target design G consisting of K functional components, i i=1 each component Ci will be assigned a criticality value ϕi as defined in Section 5. Intuitively, a higher criticality value for a Interestingly, because both gi and ei depend on wi, particular component Ci means more importance in its effect (1) is fundamentally a nonlinear multivariate optimization towards the overall correctness of final results. Moreover, problem. More discussion on solving this optimization we should allocate relatively more hardware resources to problem can be found in Section 6. components with higher criticality in order to maximize the overall system’s error resilience when constrained by total 5. Criticality Analysis hardware usage. Assume each component Ci consumes Ai units of hard- Criticality analysis (CA) provides relative measures of sig- ware in the target application’s final FPGA implementation. nificance of the effects of individual components on the 6 Journal of Electrical and Computer Engineering overall correctness of system operation. In essence, it is a Input vector tool that ranks the significance of each potential failure for each component in the system’s design based on a failure rate and a severity ranking. Intuitively, because some parts within an application may be more critical than others, more computing resources should be allocated to these more critical components, that is, stronger computing “fortifica- tion,” in order to achieve higher overall error resilience. In the framework of SFC, because the target computing device to be fortified is synthesized with a standard HDL flow Varying wi i and then realized (placed and routed) with a reconfigurable device such as an FPGA, criticality analysis is the key step to optimally allocate hardware redundancy within the target implementation. Output vector Criticality analysis has been extensively studied within software domain [24–26], but quite rare in error-resilient computing device research. Our proposed approach to quan- tifying criticality directly benefits from mature reliability Component without engineering techniques and scenario-based software archi- hardware redundancy tecture analysis. The key insight underlying our approach is Component with that criticality analysis can be recast as a combined problem hardware redundancy of uncertainty analysis and sensitivity analysis. Sensitivity Figure 5: Conceptual diagram of sensitivity and criticality analysis. analysis involves determining the contribution of individual input factors to uncertainty in model predictions. The most commonly used approach when doing a sensitivity analysis ff on spatial models is using Monte Carlo simulation. There clearly shows the boundary between di erent system mod- are a number of techniques for calculating sensitivity indices ules. from the Monte Carlo simulations, some more effective or We now use IME as an example to illustrate our FPGA efficient than others (see Figure 5). implementation in detail. More importantly, we demonstrate Figure 6 depicts our procedure of criticality analysis. For how we perform component dissection in this SFC study. each target system implemented with HDL, we first dissect it IME computes the motion vector predictor (MVP) of current into multiple components. Often based on domain-specific macroblock (MB). There are many kinds of algorithms knowledge of designer, a set of key components is selected for block-based IME. The most accurate strategy is the and undergoes our criticality analysis. Fault modeling and full search (FS) algorithm. By exhaustively comparing all injection mechanism are discussed in Section 5.2, and how reference blocks in the search window, FS gives the most to quantify the output errors is described in Section 5.3. accurate motion vector which causes minimum sum of ff ff To facilitate later discussion, we now define two metrics: absolute di erences (SAD) or sum of square di erence (SSD). Because motion estimation in H.264/AVC supports sensitivity ψi and criticality ϕi for a specific component i at variable block sizes and multiple reference frames, high error rate e, the sensitivity value ψi = Δyi(e)/Δwi.Moreover, computational complexity and huge data trafficbecome we define the criticality value ϕi as main difficulties in VLSI implementation [23]. Therefore, ⎡ ⎤ current VLSI designs usually adopt parallel architecture to ⎣ 1/ψi ⎦ ϕi = E , (3) increase the total throughput and solve high computational j 1/ψj (e) complexity. We implemented our IME unit in Verilog HDL language by closely following the block diagram in Figure 7. where E[·] denotes the expected averaging over a wide range Among all units, PE array and four-input comparators of error injection rates. In our case, we picked an i.i.d. perform computation on input pixel values and are most uniform distribution for different error rates. ψi can be critical to the accuracy of final outputs. We totally have further approximated as a linear equation ψi(e) = ai(e)+ 41 four-input comparators and 138 PE units. Our approach bi(e)wi.Bothai and bi values can be obtained through allows either fortifying some or all of these subunits. performing least square data fitting on our empirical data exemplified in Figure 4. 5.2. Fault Modeling and Injection. Much of our fault model- ing and error injection mechanism is based on prior studies 5.1. System Component Dissection. For a given target hard- [27–29]. Our fault injection and evaluation infrastructure ware implementation, before performing sensitivity and crit- exploit Verilog programming interface (VPI) in order to icality analysis, the first step is to decompose the overall sys- quickly and flexibly inject emulated hardware defects into tem into components. In this study, such system dissection Verilog-based digital design in real time. Unlike many typically follows naturally with individual module boundary. traditional fault injection methods, which either need to In most instances, block diagrams of logic implementation modify the HDL code or utilize simulator commands for Journal of Electrical and Computer Engineering 7 Input vector Probabilistic Error injection error vector and modeling Perfect system Defective system Output variance Sensitivity Criticality analysis analysis Figure 6: Flow chart of criticality analysis. fault injection, our VPI-based approach is minimally invasive pixel values will not detect and record the visual differences to the target design. Due to standardization of the VPI, critical to human perception [30]. In this study, we feel the framework is independent from the used simulator. other instances of comparing. In particular, the methodology Additionally, it does not require recompilation for different presented here is based on a stochastic Petri-net (SPN) graph fault injection experiments like techniques modifying the We instead calculate the Hausdorff distance to extract and Verilog code for fault injection. record local and global features from both images, compare As shown in Figure 8, at the beginning of the simu- them, and define the percentage of similarity. lation, the simulator calls the system with the command The Hausdorff distance measures the extent to which $HDL InjectError, which generates one fault injection each point of a “model” set lies near some point of an callback that is subsequently registered according to the spec- “image” set and vice versa. Thus, this distance can be used ified arguments fault type and place. Each of such callbacks to determine the degree of resemblance between two objects is specified by its time instant and duration and will be that are superimposed on one another. In this paper, we scheduled to execute at the injection instant and the end provide efficient algorithms for between all possible relative of injection. Then, when the inject instant reaches, the positions of a binary image and a model. according callback will be executed by the simulator. This Given two finite point sets A = a1, a2, ..., ap and B = callback sets the conditions when the value modification on b1, b2, ..., bq, the Hausdorff distance is defined as the specified place will occur. During the injection, the value of the injection place is modified by using VPI to control the H(A, B) = max(h(A, B), h(B, A)), (4) Verilog simulator. At the end of fault, the injection callback is executed again to restore the normal value. One advantage where of using VPI-based fault injection is that it can be applied to H(A, B) = max mina − b all VPI compliant Verilog simulators. a∈A b∈B (5) In order to make fault injection and evaluation feasible, · we limit the fault models used in this work to single bit faults. and is some underlying norm on the points of A and ff A variety of fault models are defined in order to represent B (e.g., the L2 or Euclidean norm). The Hausdor distance real physical faults that occur in integrated circuits (ICs). We H(A, B) is the maximum of h(A, B)andh(B, A). Thus, totally consider four kinds of hardware fault. In this work, it measures the degree of mismatch between two sets by the bit flip representing the inverted value at the instant measuring the distance of the point of A that is farthest from ff ts is differentiated from the toggling bit flip that toggles any point of B and vice versa. Intuitively, if the Hausdor with original value. In principle, fault types can be readily distance is d, then every point of A must be within a distance extended with additional fault types. d of some point of B and vice versa. Thus, the notion of resemblance encoded by this distance is that each member of A be near some member of B and vice versa. Unlike most 5.3. Evaluating Fault Impact. The criticality analysis in SFC methods of comparing shapes, there is no explicit pairing requires accurately quantifying the impact of injected hard- of points of A with points of B (e.g., many points of A ware faults by comparing the result images with the reference may be close to the same point of B). The function H(A, B) ones. Unfortunately, computing the overall differences of can be trivially computed in time O(pq) for two-point sets 8 Journal of Electrical and Computer Engineering 2 reference images Motion estimation data Pixels register array 2D PE array 41 four-input comparators Register array Figure 7: Block diagram of an integer motion estimator design with a macroblock parallel architecture and sixteen 2D PE arrays [23]. VPI fault injection library (C/C++ code) Fault injection callbacks Bit toggle Bit flip HDL InjectFault Stuck at 0 or 1 Open fault Verilog simulator VPI Test bench file Value Callback $HDL InjectError modification registration Figure 8: Block diagram of VPI simulation and fault injection flow [29]. Journal of Electrical and Computer Engineering 9 of size p and q, respectively, and this can be improved to 6.2. Solving Allocation Problem. In this section, we deter- O((p + q)log(p + q)) [30]. mine the optimal allocation for a generic hardware design subject to a set of hardware redundancy with various costs, 6. Optimally Allocating Hardware Redundancy for example, solving the optimization problem defined in ∈ ∈ = (6). Note arg max wi,i [1,n]R equals arg maxwi,i [1,n] ln R n ∈ Assuming that our target system consists of n functional arg maxwi,i [1,n] i=1 ln Ri(wi), where ln(ai + biwi)canbe components and error propagation within the target system extended using Taylor series as is multiplicative. Associated with each of the n functional d ln(ai + biwi) | · − ··· components, there exist several choices of design alternatives ai,0 + bi,0wi,0 + wi=wi,0 wi wi,0 + having different hardware redundancy ratio w (as defined dwi in Section 3). Let ei represent the ith component with a ≈ αiwi + βi, specified inherent error tolerance, and let Ri(w) denote the (7) derived error tolerance function of the ith component when the hardware redundancy ratio of ei is w. Given the overall 2 where αi = (bi,0)/(ai,0 + bi,0wi,0)andβi = ((ai,0 + bi,0wi,0) − hardware cost limit of H, the problem is to determine which bi,0wi,0)/(ai,0 + bi,0wi,0). In practice, wi,0 should be chosen hardware redundancy ratio to select for each key component ∗ to be close to the optimal solution wi of (6)inorderto in order to achieve the greatest error resilience while keeping approximate more accurately. Therefore, (6)becomes the total hardware cost within the allowable amount. This formulation of the problem leads to the maximization of n n ≤ system reliability R given by the product of unit error arg max αiwi + βi,s.t. hi,0(wi +1) H. (8) w ,i∈[1,n] resilience i i=1 i=1