System Software Techniques to Enhance Reliability of Modern Platforms

UNIVERSITY OF THESSALY PHDTHESIS System software techniques to enhance reliability of modern platforms. Author: Supervisor: Konstantinos PARASYRIS Nikolaos BELLAS Advising committee: Nikolaos BELLAS, Spyros LALIS, Christos D. ANTONOPOULOS A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering October 17, 2018 Institutional Repository - Library & Information Centre - University of Thessaly 07/06/2020 01:59:22 EEST - 137.108.70.13 i “Optimism is the faith that leads to achievement. Nothing can be done without hope and confidence.” "Helen Keller" Institutional Repository - Library & Information Centre - University of Thessaly 07/06/2020 01:59:22 EEST - 137.108.70.13 ii Abstract Konstantinos PARASYRIS System software techniques to enhance reliability of modern platforms. Chip manufacturers introduce redundancy at various levels of CPU design to guarantee correct operation even for worst-case combinations of non-idealities in process variation and system operating conditions. This redundancy is implemented partly in the form of wide voltage margins. This PhD dissertation is based on the concept that these conservative design runes are mostly excessive, as they account for execution scenarios that rarely ap- pear during the lifetime of the systems. If the faults are ignored the system will result to application crashes or even system-wide failures. Our software based approach treats these faults to enable execution in such conditions. The approach is based on the concept that many applications domains it is not the exact output that matters but a rough estimation of the output. Therefore, we propose a programming model in which the developer can define which parts of the application are more significant than others. The programming model extends a widely used parallel programming model, called OpenMP. The developer provides information about the significance of computations and the programming model ex- poses a parameter, called ratio, which can control the extend of quality degradation and energy efficiency. The idea of significance-aware computing is ported into two different computing paradigms, the fault tolerant one and the approximate. In the case of fault tolerant computing we implement a significance-centric programming model and runtime support which sets the supply voltage in a multicore CPU to sub-nominal values to reduce the energy footprint and provide mechanisms to control output quality. The developers specify the significance of application tasks respecting their contribution to the output quality and provide check and repair functions for handling faults. On a multicore system we evaluate our approach using an energy model which quanti- fies the energy reduction. When executing the least significant tasks unreliably, our approach leads to 20% CPU energy reduction with respect to a reliable execution and has minimal quality degradation. In the case of approximate computing, we implement a similar programming model that promotes the combination of the significance and ratio features. The approach using analytical models of the energy consumption of the application can Institutional Repository - Library & Information Centre - University of Thessaly 07/06/2020 01:59:22 EEST - 137.108.70.13 iii efficiently decide the degree of approximation to meet certain user defined energy requirements. There are many application domains which require their computations to be per- formed without any errors. Our study on the x86-64 Haswell and Skylake multicore microarchitectures reveals wide voltage which can be removed without observing any errors. These margins can reach up to 22% and 13% of the nominal supply voltage for the Skylake and Haswell architectures respectively. The margins vary across different microarchitectures, different chip parts of the same microarchitecture, and across different workloads. We introduce a model which can be used dynamically to adjust the supply voltage of modern multicore x86-64 CPUs to just above the minimum required for safe operation. We identify a set of performance metrics – directly measurable via performance monitoring counters – with high predictive value for the minimum tolerable supply voltage (Vmin ). We use benchmarks that vary in terms of application domain, resource utilization and pressure, and software/hardware interaction characteristics to train a Vmin prediction model. Finally, at execution time those metrics are moni- tored and serve as input to the model, in order to predict and apply the appropriate Vmin for the workload. Compared to the conventional approaches, our methodology achieves up to 42% energy savings for the Skylake family and 34% for the Haswell family for complex, real-world applications. Last but not least, during the course of the thesis we implemented the infrastruc- ture to observe accurately the application resiliency of faults as well as to identify voltage and frequency margins of modern processors. GemFI is a fault injection tool based on the cycle accurate full system simulator Gem5. GemFI provides fault injection methods and is easily extensible to support future fault models. It also supports multiple processor models and ISAs and allows fault injection in both functional and cycle-accurate simulations. GemFI offers fast-forwarding of simulation campaigns via checkpointing. Moreover, it facilitates the parallel execution of campaign exper- iments on a network of workstations. XM 2 enables the evaluation of software on systems operating outside their nominal margins. It supports both bare-metal and OS-controlled execution using an API to control the fault injection procedure and provides automatic management of experimental campaigns. Institutional Repository - Library & Information Centre - University of Thessaly 07/06/2020 01:59:22 EEST - 137.108.70.13 iv PerÐlhyh KwnstantÐnoc Παρασύρης Αύξηση ανθεκτικόthtac twn efarmog¸n se σφάλματa se σύγχροnec πλατφόρμες H biomhqanÐa twn hmiagwg¸n èqei basisteÐ tic teleutaÐec dekaetÐec sto νόμο tou Moore, o opoÐoc problèpei κάθε 18 μήνες ton διπλασιασμό tou ariθμού twn transistors ανά moνάδα επιφάνειας se oloklhrwmèna kukl¸mata basismèna se teqnologÐa CMOS. Se antidiastoλή me to pareλθόn (prin από to 2004), όπου oi sqediastèc epexergasti- k¸n susτημάτwn, mèsw thc αύξησης tou ariθμού twn transistor, eÐqan wc stόqo thn antÐstoiqh αύξηση thc απόδοσης, h nèa πραγματικόthta jètei thn meÐwsh thc katανάλω- shc iσχύος (kai enèrgeiac) wc thn μεγαλύτερη πρόκληση sthn sqedÐash epexergast¸n. Tautόqrona, h megάlh πυκνόthta topojèthshc twn transistors οδηγούν sthn ana- ξιόπιsth leitourgÐa twn σύγχροnwn epexergast¸n. H anaxiopistÐa αυτή ofeÐletai en mèrei stic dunamikèc διακυμάνσειc ρεύματoc kai tάσης (supply voltage) oi opoÐec eÐnai pio piθανό na dhmiουργήσουν λάθη qroniσμού se mikrèc gewmetrÐec teqnologÐac CMOS. EpÐshc eÐnai pio piθανά ta kataskeuasτικά λάθη (fabrication faults) λόgw atelei¸n thc diadikasÐac fwtolijografÐac. Epiplèon, paroδικά λάθη (transient faults) pou ofeÐlo- ntai se exwgeneÐc παράγοntec, όπως alpha particles, èqoun μεγαλύτερη epÐdrash se μικρόterec gewmetrÐec teqnologÐac CMOS. Gia na epiteuqjeÐ αξιόπιsth leitourgÐa υπό autèc tic συνθήκες, oi sqediastèc σύγχροnwn epexergastik¸n susτημάτwn qrhsimo- poiούν sunthrhtikèc sqediastikèc teqnikèc, όπως υψηλά perij¸ria tάσης trofodosÐac (Vdd) kai συχνόthtac rologiού ètsi ¸ste o epexergasτής na prostατεύετai από κάθε piθανόthta laj¸n qroniσμού. Oi sunthrhtikèc autèc teqnikèc mporeÐ men na prosta- τεύουν thn αξιόπιsth leitourgÐa tou epexergasτή, èqoun όμως wc apotèlesma μεγάλη spatάλη se iσχύ kai enèrgeia h opoÐa ftάνει mèqri kai to 35% se arketèc peript¸seic. H βασική idèa thc παρούσας didaktορικής διατριβής basÐzetai sto όti autèc oi su- nthrhtikèc teqnikèc σχεδιασμού eÐnai σχεδόn πάντa perittèc kai antistoiχούν se peript¸seic leitourgÐac pou σχεδόn potè den πρόκειtai na συμβούν tautόqrona katά thn διάρκεια thc leitourgÐac tou epexergasτή. Qrhsimopoi¸ntac teqnikèc kurÐwc sto e- pÐpedo logiσμικού susτήματoc kai efarmog¸n, h διατριβή proteÐnei thn leitourgÐa tou epexergasτή poλύ kontά stic akraÐec katastάσειc leitourgÐac tou kai thn εξάλειyh tou μεγαλύτερου mèrouc tou sqediasτικού perijwrÐou. Gia pαράδειgma, h δυναμική meÐwsh thc tάσης trofodosÐac ενός epexergasτή katά thn διάρκεια leitourgÐac tou mporeÐ na epifèrei μεγάλες belti¸seic sthn katανάλωση iσχύος tou, αλλά, εφόσοn den elegqjeÐ, eÐnai dunatόn na dhmiουργήσει lanjasmèna apotelèsmata ή kai na διακόψει απόtoma thn Institutional Repository - Library & Information Centre - University of Thessaly 07/06/2020 01:59:22 EEST - 137.108.70.13 v leitourgÐa tou. Από thn άλλη, h αύξηση thc συχνόthtac tou rologiού ενός epexerga- sτή, mporeÐ men na belti¸sei thn απόδοση kai na epifèrei meÐwsh tou χρόnou ektèleshc, αλλά, mporeÐ na dhmiουργήσει προβλήματa sthn axiopistÐa thc leitourgÐac tou. H διατριβή basÐzetai sthn idèa όti se pollèc efarmogèc (ή epimèrouc φάσειc e- farmog¸n) to akribèc apotèlesma eÐte den mac endiafèrei, eÐte eÐnai poλύ apaiτητικό se κύκλους mhqaνής kai katανάλωση enèrgeiac gia na mac sumfèrei na upologisteÐ. ProteÐnoume èna nèo programmatisτικό montèlo sto opoÐo o programmatisτής mpore- Ð na qarakthrÐsei thn σημαντικόthta twn diaforetik¸n τμημάτwn μιάς efarmoγής kai thn suneiσφορά touc sthn ποιόthta tou τελικού apotelèsmatoc. To programmatisti- κό montèlo epekteÐnei to gnwstό montèlo OpenMP pou qrhsimopoieÐtai eurèwc ston παράλληλο programmatiσμό. Mèsw thc plhroforÐac thc σημαντικόthtac pou parèqei o programmatisτής

Load more