Reliability Analysis of Artificial Intelligence Systems Using
Total Page:16
File Type:pdf, Size:1020Kb
Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles Yili Hong1, Jie Min1, Caleb B. King2, and William Q. Meeker3 1Department of Statistics, Virginia Tech, Blacksburg, VA 24061 2JMP Division, SAS, Cary, NC 27513 3Department of Statistics, Iowa State University, Ames, IA 50011 Abstract Artificial intelligence (AI) systems have become increasingly common and the trend will continue. Examples of AI systems include autonomous vehicles (AV), computer vi- sion, natural language processing, and AI medical experts. To allow for safe and effective deployment of AI systems, the reliability of such systems needs to be assessed. Tradition- ally, reliability assessment is based on reliability test data and the subsequent statistical modeling and analysis. The availability of reliability data for AI systems, however, is limited because such data are typically sensitive and proprietary. The California Depart- ment of Motor Vehicles (DMV) oversees and regulates an AV testing program, in which many AV manufacturers are conducting AV road tests. Manufacturers participating in the program are required to report recurrent disengagement events to California DMV. This information is being made available to the public. In this paper, we use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modeling and analyzing the recurrent events data arXiv:2102.01740v1 [cs.AI] 2 Feb 2021 from AV driving tests. We use traditional parametric models in software reliability and propose a new nonparametric model based on monotonic splines to describe the event process. We develop inference procedures for selecting the best models, quantifying un- certainty, and testing heterogeneity in the event process. We then analyze the recurrent events data from four AV manufacturers, and make inferences on the reliability of the AI systems in AV. We also describe how the proposed analysis can be applied to assess the reliability of other AI systems. Key Words: Disengagement Events; Fractional Random Weight Bootstrap; Gom- pertz Model; Monotonic Splines; Software Reliability; Self-driving Cars. 1 1 Introduction 1.1 The Problem With the rapid development of new technology, artificial intelligence (AI) systems are emerg- ing in many areas. Typical applications of AI systems include autonomous vehicles (AV), computer vision, speech recognition, and AI medical experts. The reliability and safety of AI systems need to be assessed before massive deployment in the field. For example, the reliability of AV needs to be demonstrated so that people can use them with confidence. Traditionally, reliability assessment of products and systems is based on reliability test data collected from the laboratory and the field. Reliability information is then extracted from statistical modeling and analysis of the data. Commonly used data types for reliability analysis are time-to-event data, degradation data, and recurrent events data. Reliability data collected by manufacturers are highly sensitive and are usually not publicly available. In the area of AV, however, the California Department of Motor Vehicles (DMV) has launched an AV driving program. More details of the study are given in Section 2. Many AV manufacturers are participating in the program and are conducting AV road test in California. Manufacturers are required to report disengagement events and mileage information to the California DMV. The reported events are available for public access. Because of the availability of these recurrent events data, we focus on the reliability analysis of the AI systems in AV units in this paper. A disengagement event happens when the AI system and/or the backup driver determines that the driver needs to take over the driving. The recurrence rates of disengagement events can be used as a proxy for the reliability of the AI system in AV. A lower occurrence rate (event rate) of disengagement events would indicate a more reliable AI system in the AV. In the reliability literature, parametric forms have been used to describe the event rate through a nonhomogeneous Poisson process (NHPP) model. In practice, some specific questions arise in the analysis of the recurrent disengagement events data, • How to model the event process, and what kind of parametric forms should be used? • Does the parametric form provide an adequate fit to the data, and are there any other flexible forms for modeling? • Is there any population heterogeneity in the event processes from multiple test units? To answer those practical questions, we develop a statistical framework for modeling and analyzing the recurrent events data from AV driving tests. Specifically, we use the NHPP model to describe the disengagement event process with adjustment for the time-varying mileage data using traditional parametric models in software reliability for the intensity functions. We also propose a new nonparametric model based on 2 monotonic splines to describe the event process. The spline model is flexible enough to fit various patterns in the event process and can also be used to assess if the fully-parametric model provides an adequate fit. We develop inference procedures for selecting the best models and quantifying uncertainty using the fractional-random-weight bootstrap. The parametric models and spline models can be used as complementary tools. In addition, we use the gamma frailty model to quantify and assess heterogeneity in the event process. From the California driving study, we use data from four manufacturers that have been conducting extensive AV driving tests in California. We apply the developed methods to analyze the recurrent events data from the four AV manufacturers. Based on the modeling, we make inferences on the reliability of the AI systems in AV, and summarize interesting findings on the reliability of the AI systems in AV. Although our analysis focuses on the AI systems in AV, the statistical analysis can also be applied to assess the reliability of other AI systems. 1.2 Literature Review There is only a small amount of literature on reliability analysis of AI systems. Amodei et al. (2016) provided a general discussion on AI safety and outlined five concrete areas for AI safety research. Bosio et al. (2019) conducted a reliability analysis of deep convolutional neural network (CNN) developed for automotive applications using fault injections. Goldstein et al. (2020) investigated the impact of transient faults on the reliability of compressed deep CNN. Zhao et al. (2020) proposed a safety framework based on Bayesian inference for critical systems using deep learning models. Alshemali and Kalita (2020) provided a review on improving the reliability of natural language processing. Due to the limited availability of test data, statistical reliability analysis of AI reliability/safety is in an emerging stage. As an example of the modeling of the reliability and safety of AV, Kalra and Paddock (2016) used a statistical hypothesis testing approach to determine the needed miles of driving to demonstrate AV safety. Asljung,˚ Nilsson, and Fredriksson (2017) used extreme value theory to model the safety of AV. Michelmore et al. (2019) designed a statistical framework to evaluate the safety of deep neural network controllers and assessed the safety of AV. Burton et al. (2020) provided a multi-disciplinary perspective on the safety of AV from engineering, ethics, and law aspects. Most existing modeling framework, however, does not involve large scale field-testing data, which is essential for reliability assessment. The California driving test data provide unique opportunities for data analysis. Regarding the analysis of the California driving data, Dixit, Chand, and Nair (2016) and Favar`o,Eurich, and Nader (2018) analyzed the causes of disengagement events using the California driving test data up to 2017 and showed the relationship between disengagement events per miles 3 and cumulative miles. Lv et al. (2018) performed a descriptive analysis of the causes of disengagement events using the California driving test data from 2014 to 2015, and concluded that software issues and limitations were the most common reasons for disengagement events. Banerjee et al. (2018) used linear regressions to model the relationship between disengagement per mile and cumulative miles using the data from 2016 to 2017. Merkel (2018) conducted data analysis on the California driving data from 2015 to 2017 with aggregated counts data and least-squares fit. Zhao et al. (2019) proposed a general conservative Bayesian inference method to estimate the rate of events (crashes and fatality) and illustrated it with the California driving test data. Boggs, Wali, and Khattak (2020) did an exploratory analysis for AV crashes from the California driving study. So far, there is no comprehensive statistical treatment for the analysis of AI reliability and especially for AV reliability. Starting in 2018, the exact disengagement event time can be extracted from the California DMV report, which makes it possible to apply recurrent events modeling techniques. In the reliability literature, NHPP models are widely used to analyze recurrent events. Zuo, Meeker, and Wu (2008), and Hong, Li, and Osborn (2015) analyzed recurrent events data with window-observations, which has similar data types with the disengagement events data from the California driving study. Parametric models, such as the Musa-Okumoto model (e.g., Musa and Okumoto 1984) and the