Cross Layer Reliability Estimation for Digital Systems
Total Page:16
File Type:pdf, Size:1020Kb
Politecnico di Torino Porto Institutional Repository [Doctoral thesis] Cross layer reliability estimation for digital systems Original Citation: Vallero, Alessandro (2017). Cross layer reliability estimation for digital systems. PhD thesis Availability: This version is available at : http://porto.polito.it/2673865/ since: June 2017 Published version: DOI:10.6092/polito/porto/2673865 Terms of use: This article is made available under terms and conditions applicable to Open Access Policy Article ("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions. html Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library and the IT-Services. The aim is to enable open access to all the world. Please share with us how this access benefits you. Your story matters. (Article begins on next page) Doctoral Dissertation Doctoral Program in Computer Engineering (29thcycle) Cross layer reliability estimation for digital systems By Alessandro Vallero ****** Supervisor(s): Prof. Stefano Di Carlo, Supervisor Doctoral Examination Committee: Prof. Regis Leveugle , Referee, TIMA Grenoble Prof. Marco Ottavi, Referee, University of “Rome Tor Vergata” Prof. Ramon Canal, UPC Barcelona Prof. Lorena Angel, TIMA Grenoble Prof. Dimitris Gizopoulos, University of Athens Politecnico di Torino 2017 Declaration I hereby declare that, the contents and organization of this dissertation constitute my own original work and does not compromise in any way the rights of third parties, including those relating to the security of personal data. Alessandro Vallero 2017 * This dissertation is presented in partial fulfillment of the requirements for Ph.D. degree in the Graduate School of Politecnico di Torino (ScuDo). I would like to dedicate this thesis to my grandparents who always open my mind toward new horizons Acknowledgements I would like to acknowledge Prof. Stefano Di Carlo, at first. He made my Ph.D. a pleasant journey during which I had the chance to learn so many things from him, from both a professional and a human point of view. Secondly, I would like to thank my mother, Linda, and my sister, Greta. They supported me emotionally, even in very stressful moments. I acknowledge Computer Architecture Lab of University of Athens. It was a pleasure to work together. During my visiting you made me feel like at home. Thank you Prof. Dimitris Gizopoulos, Sotiris Tselonis, Manolis Kaliorakis, Athanasios Chatzdimitriou, Giorgos Papadimitriou. In addition, I would like to thank all the CLERECO partners with which I collaborated for my Ph.D. activities. I would also like to acknowledge my housemates, Ivan and Daniele, who suffered me for more than two long years! Finally, I thank all the guys of my lab at Politecnico di Torino (Lab 6) who always laughed at my jokes and contributed to create a pleasant atmosphere. Abstract Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. One of the open challenges for future technologies is building “dependable” systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit’s operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unre- liability are rapidly becoming unacceptable in today’s environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully vi optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be comple- mented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains. Contents List of Figures xi List of Tables xvii Nomenclature xviii 1 Introduction4 1.1 Motivation . .4 1.2 Goals . .7 1.3 Structure of the thesis . .9 2 Dependability of digital systems 12 2.1 Dependability . 12 2.2 Reliability metrics of digital systems . 14 2.2.1 Failure Rate . 15 2.2.2 Failures In Time . 15 2.2.3 Mean Time To Failure . 15 2.2.4 Mean Time To Repair . 16 2.2.5 Mean Time Between Failures . 16 2.2.6 Availability . 16 2.2.7 Mean Workload To Failure . 17 viii Contents 2.2.8 Mean Instruction To Failure . 17 2.3 Dependability threats . 17 2.3.1 Faults . 17 2.3.2 Errors . 20 2.3.3 Failures . 21 2.4 Soft errors . 23 2.5 Dependability enhancement techniques . 26 2.5.1 The traditional approach . 28 2.5.2 Cross-layer reliability enhancement . 31 3 Fault tolerance in autonomous FPGA-based systems 33 3.1 Introduction . 33 3.2 State-of-the-art . 35 3.3 The proposed methodology . 37 3.3.1 The proposed system architecture . 37 3.3.2 The proposed partitioning methodology . 39 3.3.3 The proposed partitioning algorithm . 41 3.4 Experimental results . 44 3.5 Conclusion . 48 4 GPGPU Reliability evaluation 49 4.1 Introduction . 50 4.2 Related works . 52 4.3 The Southern Islands AMD architecture . 53 4.4 SIFI architecture and functionalities . 54 4.4.1 Fault injection engine . 56 4.4.2 ACE analysis engine . 57 Contents ix 4.5 Experimental results . 59 4.5.1 SIFI results . 59 4.5.2 A multi-faceted comparison of reliability between AMD and NVIDIA GPU architectures . 65 4.6 Conclusions . 77 5 A Bayesian model for reliability evaluation of CPU-based systems 80 5.1 Introduction . 81 5.2 Reliability analysis . 83 5.3 Experimental results . 89 5.3.1 Framework implementation . 89 5.3.2 Experiment setup . 90 5.3.3 Results . 92 5.4 Conclusion . 93 6 Reliability estimation of complex digital systems 95 6.1 Introduction . 96 6.2 Related works . 98 6.2.1 Architectural level reliability analysis . 98 6.2.2 Accounting for software effects in reliability analysis . 98 6.2.3 Bayesian models for reliability estimation . 99 6.3 The proposed system level reliability framework . 100 6.3.1 Qualitative model of the system . 101 6.3.2 Quantitative model of the system . 104 6.3.3 Reasoning on the model of the system . 108 6.4 Integrated tools framework . 109 6.4.1 Technology Domain . 109 6.4.2 Hardware Domain . 111 x Contents 6.4.3 Software Domain . 112 6.4.4 System Domain . 112 6.5 Results . 113 6.5.1 The experimental setup . 113 6.5.2 The validation framework . 115 6.5.3 Experimental Results . 119 6.6 Conclusions . 135 7 Reliability optimization of complex digital systems 137 7.1 Introduction . 138 7.2 System level modeling . 139 7.3 System optimization methodology . 142 7.3.1 Extremal optimization theory . 142 7.3.2 Definitions and notation . 143 7.3.3 Optimization algorithm . 144 7.4 Experimental Results . 149 7.4.1 Experimental setup . 149 7.4.2 Results and discussion . 151 7.4.3 From benchmarks to real applications . 155 7.5 Conclusion . 156 8 Conclusions 159 References 160 Appendix A The SER of future devices 175 List of Figures 2.1 All the aspects of dependability. 13 2.2 The chain of dependability threats. 18 2.3 Classification of faults. 19 2.4 Classification of errors caused by faulty bits. 22 2.5 Failure modes of computing systems. 22 2.6 The dependability enhancing mechanisms. 27 2.7 The conventional approach to enhance reliability usually targets one or two layers of the system. 28 2.8 Cross-layer information sharing and cooperation. 31 3.1 The proposed system architecture (Source: [1]). 38 3.2 Circuit basic components and tiles organization (Source: [1]). 39 3.3 The recovery strategies when a permanent faults occur. Fig 3.3a illustrates the relocation of the faulty logic tile into a recovery tile and the reconfiguration of the interconnection tile. Fig 3.3d shows how interconnections are reconfigured when a permanent fault is detected inside the active interconnection tile (Source: [1]). 42 3.4 FEMIP basic components (Source: [1]).