Cognitive Computing Safety: the New Horizon for Reliability

Expert Opinion ................................................................................................................................................................ Cognitive Computing Safety: The New Horizon for Reliability YUHAO ZHU VIJAY JANAPA REDDI University of Texas at Austin ......Recent advances in cognitive Safety in Cognitive models, and form a long accuracy tail dis- computing have brought widespread Computing Systems tribution that is hard to curtail. excitement for various machine lear- Safety in cognitive computing is inher- Tail accuracy leads to poor worst-case ning–based intelligent services, ranging ently associated with the accuracy issue accuracy guarantees. In mission-critical from autonomous vehicles to smart traf- in machine learning. Just as with any sys- systems, worst-case accuracy is known fic-light systems. To push such cognitive tem-level behavior, accuracy can be clas- to raise serious safety concerns. For services closer to reality, recent research sified into two categories: average-case example, the autopilot system in a self- has focused extensively on improving and worst-case accuracy. The latter driving car might be working ideally for the performance, energy efficiency, pri- determines a cognitive computing sys- millions of miles, but it only takes one life- vacy, and security of cognitive comput- tem’s safety. Improving the worst-case threatening incident to cause significant ing platforms. accuracy, however, is hard due to the distrust in such autonomous systems.2 Among all the issues, a rapidly rising lack of interpretability in machine learn- and critical challenge to address is the ing. Machine learning algorithms are Worst-Case Accuracy practice of safe cognitive computing— inherently stochastic approximations of The issue of tolerating worst-case accu- that is, how to architect machine lear- the ground truth. racy is not unique to the cognitive ning–based systems to be robust against computing domain. For instance, in the uncertainty and failure to guarantee that aviation industry, engineers constantly they perform as intended without causing Meaning of Safety address the issue of safety. But there is Most machine learning systems today harmful behavior. Addressing the safety a notable difference between cognitive focus on the average-case accuracy. In issue will involve close collaboration systems and today’s prevalent aviation mature machine learning–based applica- among different computing communities, systems that make it fundamentally hard tion domains, such as image classifica- and we believe computer architects must to guarantee safety in cognitive comput- tion, the average accuracy of well-trained play a key role. In this position paper, we ing—that is, the system interpretability. machine learning models could be 99 first discuss the meaning of safety and System interpretability means the percent. However, just as datacenter the severe implications of the safety issue ability to rationalize and identify the root systems suffer from the long-tail latency in cognitive computing. We then provide cause of a system failure. From time to issue,1 machine learning systems suffer a framework to reason about safety, and time, the aviation industry suffers from from tail accuracy, in which a few we outline several opportunities for the mid-air tragedies, but for every such trag- requests exhibit poor accuracy due to architecture community to help make edy, investigators can understand what uncertainties in the machine learning cognitive computing safer. exactly went wrong, how to fix it, and, most importantly, how to prevent such incidents from happening again in the Editor's note: We invited some domain experts to discuss the opportunities and future. This is because flight control is challenges surrounding cognitive architectures. The following are their views on based on interpretable control theory, the topic. —Alper Buyuktosunoglu and Pradip Bose physics, and aerodynamics. For example, in the case of Air France Flight 447, ............................................................ 2 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE unclear how we would improve a self- thus introduces error in certain scenarios. The safety gap driving car—a multicomponent system For instance, an image classifier mistak- involving several decision-making enly recognizing a cat in an image as a Oracle model stages—to prevent even the same inci- dog is the result of the learning gap. The dent that led to a particular crash from learning gap is typically introduced during recurring. the model training stage. Using reliability lingo, errors intro- Learning Bridging the Safety Gap duced by the learning gap can be gap Safety in cognitive computing systems is regarded as a form of permanent (as a multidisciplinary problem. However, opposed to transient) faults caused by computer architects are no strangers to design bugs. Therefore, they are not building safe systems. We commonly amenable to traditional backward error- Best-trained machine refer to it as “reliability.” For decades, the recovery techniques (such as check- learning model computer architecture community has pointing), nor to forward error-recovery managed to build highly resilient and fault- techniques (such as TMR or execution tolerant architectures. For instance, we duplication6). Execution guardband, or allocate a large operating The execution gap, on the other hand, gap margin, to tolerate process, thermal, and is introduced during the model execution voltage variations.4 We have also estab- (inference) stage. The execution gap is lished fundamental redundancy-based introduced in two forms. First, to improve techniques, such as triple module redun- the performance and energy efficiency of Hardware- accelerated machine dancy (TMR) and instruction duplication, model inference, hardware architects learning model to detect and recover from errors.5 often build specialized accelerators that Resilient architectures and safe cog- rely on various optimizations, such as nitive systems share a similar goal: guar- weight compression, data quantization, Figure 1. The safety gap in cognitive anteeing expected system behaviors in and static RAM fault injection.7,8 These computing. the event of a failure. Therefore, archi- optimizations are unsafe because they tects have the unique opportunities to intentionally trade off accuracy with execution efficiency. Second, model execu- investigators were able to accurately transfer the experience of building resilient architectures to building safe cogni- tion could also suffer from traditional attribute the crash to an aerodynamic tive computing architectures, and to reliability emergencies, such as memory stall, which was then attributed to the design new solutions to overcome new failures, circuit defects, and real-time vio- mishandling of the pitot tubes being safety challenges. To that end, we intro- lations.9,15 Overall, hardware execution obstructed by ice crystals. The whole duce the notion of the safety gap—a introduces additional sources of error, accident was a pivot point for the civil avi- framework for computer architects to exacerbating the worst-case accuracy. ation industry, leading to an overhaul of reason about the safety issue in cogni- To improve the safety of cognitive how measurement devices are installed tive computing. computing, computer architects should and how pilots are trained to handle have two objectives. First, we must instrument anomalies.3 The Safety Gap ensure that in building hardware acceler- In contrast, the same ability to find an A cognitive system’s safety gap refers ators, we do not introduce an execution incident’s root cause and learn from it to the gap between the worst-case gap, while still providing the performance cannot be said for cognitive systems accuracy guarantee that an oracle sys- and energy benefits of hardware special- because, at least at present, we lack a tem demands and the actual accuracy ization. Second, we must provide deep understanding of how and why that a particular implementation pro- vehicles that help improve model learn- machine learning model works in the first vides. The safety gap is the composition ing accuracy and thereby minimize the place. The lack of system interpretability of two gaps: learning and execution learning gap. will only worsen as we begin to compose (see Figure 1). To that end, we discuss a few poten- many such systems together, effectively The learning gap refers to the gap tial directions under both objectives to increasing the system’s opaqueness. If between the oracle and the best-trained foster research in this emerging and we cannot explain how each component machine learning model. The learning gap important topic. Broadly, our suggestions works, we cannot attribute the root exists because even the best machine involve tools, architecture-level design, cause of a whole system failure to a par- learning model is likely not a perfect rep- and principles that dictate accountable ticular component. For example, it is resentation of the objective reality, and system operation. ................................................................. JANUARY/FEBRUARY 2017 3 .............................................................................................................................................................................................. EXPERT OPINION Avoiding the Execution Gap and speed up the process of closing the research in building safe cognitive We must build

Load more