Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

Yili Hong1, Jie Min1, Caleb B. King2, and William Q. Meeker3

1Department of Statistics, Virginia Tech, Blacksburg, VA 24061 2JMP Division, SAS, Cary, NC 27513 3Department of Statistics, Iowa State University, Ames, IA 50011

Abstract

Artificial intelligence (AI) systems have become increasingly common and the trend will continue. Examples of AI systems include autonomous vehicles (AV), computer vi- sion, natural language processing, and AI medical experts. To allow for safe and effective deployment of AI systems, the reliability of such systems needs to be assessed. Tradition- ally, reliability assessment is based on reliability test data and the subsequent statistical modeling and analysis. The availability of reliability data for AI systems, however, is limited because such data are typically sensitive and proprietary. The California Depart- ment of Motor Vehicles (DMV) oversees and regulates an AV testing program, in which many AV manufacturers are conducting AV road tests. Manufacturers participating in the program are required to report recurrent disengagement events to California DMV. This information is being made available to the public. In this paper, we use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modeling and analyzing the recurrent events data

arXiv:2102.01740v1 [cs.AI] 2 Feb 2021 from AV driving tests. We use traditional parametric models in software reliability and propose a new nonparametric model based on monotonic splines to describe the event process. We develop inference procedures for selecting the best models, quantifying un- certainty, and testing heterogeneity in the event process. We then analyze the recurrent events data from four AV manufacturers, and make inferences on the reliability of the AI systems in AV. We also describe how the proposed analysis can be applied to assess the reliability of other AI systems. Key Words: Disengagement Events; Fractional Random Weight Bootstrap; Gom- pertz Model; Monotonic Splines; Software Reliability; Self-driving .

1 1 Introduction

1.1 The Problem

With the rapid development of new technology, artificial intelligence (AI) systems are emerg- ing in many areas. Typical applications of AI systems include autonomous vehicles (AV), , speech recognition, and AI medical experts. The reliability and safety of AI systems need to be assessed before massive deployment in the field. For example, the reliability of AV needs to be demonstrated so that people can use them with confidence. Traditionally, reliability assessment of products and systems is based on reliability test data collected from the laboratory and the field. Reliability information is then extracted from statistical modeling and analysis of the data. Commonly used data types for reliability analysis are time-to-event data, degradation data, and recurrent events data. Reliability data collected by manufacturers are highly sensitive and are usually not publicly available. In the area of AV, however, the California Department of Motor Vehicles (DMV) has launched an AV driving program. More details of the study are given in Section 2. Many AV manufacturers are participating in the program and are conducting AV road test in California. Manufacturers are required to report disengagement events and mileage information to the California DMV. The reported events are available for public access. Because of the availability of these recurrent events data, we focus on the reliability analysis of the AI systems in AV units in this paper. A disengagement event happens when the AI system and/or the backup driver determines that the driver needs to take over the driving. The recurrence rates of disengagement events can be used as a proxy for the reliability of the AI system in AV. A lower occurrence rate (event rate) of disengagement events would indicate a more reliable AI system in the AV. In the reliability literature, parametric forms have been used to describe the event rate through a nonhomogeneous Poisson process (NHPP) model. In practice, some specific questions arise in the analysis of the recurrent disengagement events data, • How to model the event process, and what kind of parametric forms should be used? • Does the parametric form provide an adequate fit to the data, and are there any other flexible forms for modeling? • Is there any population heterogeneity in the event processes from multiple test units? To answer those practical questions, we develop a statistical framework for modeling and analyzing the recurrent events data from AV driving tests. Specifically, we use the NHPP model to describe the disengagement event process with adjustment for the time-varying mileage data using traditional parametric models in software reliability for the intensity functions. We also propose a new nonparametric model based on

2 monotonic splines to describe the event process. The spline model is flexible enough to fit various patterns in the event process and can also be used to assess if the fully-parametric model provides an adequate fit. We develop inference procedures for selecting the best models and quantifying uncertainty using the fractional-random-weight bootstrap. The parametric models and spline models can be used as complementary tools. In addition, we use the gamma frailty model to quantify and assess heterogeneity in the event process. From the California driving study, we use data from four manufacturers that have been conducting extensive AV driving tests in California. We apply the developed methods to analyze the recurrent events data from the four AV manufacturers. Based on the modeling, we make inferences on the reliability of the AI systems in AV, and summarize interesting findings on the reliability of the AI systems in AV. Although our analysis focuses on the AI systems in AV, the statistical analysis can also be applied to assess the reliability of other AI systems.

1.2 Literature Review

There is only a small amount of literature on reliability analysis of AI systems. Amodei et al. (2016) provided a general discussion on AI safety and outlined five concrete areas for AI safety research. Bosio et al. (2019) conducted a reliability analysis of deep convolutional neural network (CNN) developed for automotive applications using fault injections. Goldstein et al. (2020) investigated the impact of transient faults on the reliability of compressed deep CNN. Zhao et al. (2020) proposed a safety framework based on Bayesian inference for critical systems using deep learning models. Alshemali and Kalita (2020) provided a review on improving the reliability of natural language processing. Due to the limited availability of test data, statistical reliability analysis of AI reliability/safety is in an emerging stage. As an example of the modeling of the reliability and safety of AV, Kalra and Paddock (2016) used a statistical hypothesis testing approach to determine the needed miles of driving to demonstrate AV safety. Asljung,˚ Nilsson, and Fredriksson (2017) used extreme value theory to model the safety of AV. Michelmore et al. (2019) designed a statistical framework to evaluate the safety of deep neural network controllers and assessed the safety of AV. Burton et al. (2020) provided a multi-disciplinary perspective on the safety of AV from engineering, ethics, and law aspects. Most existing modeling framework, however, does not involve large scale field-testing data, which is essential for reliability assessment. The California driving test data provide unique opportunities for data analysis. Regarding the analysis of the California driving data, Dixit, Chand, and Nair (2016) and Favar`o,Eurich, and Nader (2018) analyzed the causes of disengagement events using the California driving test data up to 2017 and showed the relationship between disengagement events per miles

3 and cumulative miles. Lv et al. (2018) performed a descriptive analysis of the causes of disengagement events using the California driving test data from 2014 to 2015, and concluded that software issues and limitations were the most common reasons for disengagement events. Banerjee et al. (2018) used linear regressions to model the relationship between disengagement per mile and cumulative miles using the data from 2016 to 2017. Merkel (2018) conducted data analysis on the California driving data from 2015 to 2017 with aggregated counts data and least-squares fit. Zhao et al. (2019) proposed a general conservative Bayesian inference method to estimate the rate of events (crashes and fatality) and illustrated it with the California driving test data. Boggs, Wali, and Khattak (2020) did an exploratory analysis for AV crashes from the California driving study. So far, there is no comprehensive statistical treatment for the analysis of AI reliability and especially for AV reliability. Starting in 2018, the exact disengagement event time can be extracted from the California DMV report, which makes it possible to apply recurrent events modeling techniques. In the reliability literature, NHPP models are widely used to analyze recurrent events. Zuo, Meeker, and Wu (2008), and Hong, Li, and Osborn (2015) analyzed recurrent events data with window-observations, which has similar data types with the disengagement events data from the California driving study. Parametric models, such as the Musa-Okumoto model (e.g., Musa and Okumoto 1984) and the Gompertz model (e.g., Huang, Lyu, and Kuo 2003), are commonly used in software reliability applications (e.g., Wood 1996). Ehrlich et al. (1998) used accelerated testing methods to study software reliability. Burke, Jones, and Noufaily (2020) proposed flexible parametric models for time-to-event data analysis. Useful reference books for reliability data analysis for researchers in the AI reliability area include Meeker and Escobar (1998), and Lawless (2003). Overall, parametric models are common in analyzing recurrent events data in the context of reliability study. We propose to use monotonic splines (Ramsay 1988, and Meyer 2008) as a nonparametric method to model the event process. Although monotonic splines are used in some degradation settings (Xie et al. 2018), the application of monotonic splines to recurrent events in reliability is new. To model population heterogeneity, the gamma frailty model is used. An early use of the gamma frailty model in reliability is found in Lawless (1995). More recently, Shan, Hong, and Meeker (2020) used the gamma frailty model to describe seasonal warrant return data. Duchateau and Janssen (2008) provide a comprehensive review of frailty models. Due to the complicated structure of the window-observed recurrent events data in our study, we use fractional-random-weight bootstrap as a convenient way to generate bootstrap samples for statistical inference. The idea of fractional-random-weight bootstrap is introduced in Rubin (1981), and some theoretical properties are shown in Jin, Ying, and Wei (2001). Xu et al. (2020) demonstrated the use of fractional-random-weight bootstrap in many complex applications in reliability, survival analysis, and logistic regression. A simultaneous confidence

4 band (SCB) can be used to assess if a parametric model is adequate for fitting the data. Hong, Escobar, and Meeker (2010) showed that an SCB could be obtained from a simultaneous confidence region (SCR) for parameters. However, in the context of bootstrap, it is not straightforward to construct SCR for parameter estimators with multiple dimensions. Hence, we use the idea of the equal-precision band in Nair (1984) and use bootstrap samples to calibrate pointwise confidence intervals to provide SCB to quantify statistical uncertainty. In summary, we provide a general analytic framework by integrating existing methods and proposing new methods for reliability analysis of data from an AI study. The parametric and nonparametric models, and the statistical interval and testing procedures will be useful tools for practitioners working in the area of AI reliability.

1.3 Overview

The rest of the paper is organized as follows. Section 2 describes the California autonomous vehicle driving study and introduces the data. Section 3 describes the parametric model, the spline model, and the gamma frailty model that are used to describe the recurrent disengage- ment events data. Section 4 describes the parameter estimation procedures and the inference procedures for various models. Section 5 conducts a simulation study to show the statistical performance of the estimation procedures. Section 6 conducts the data analysis and summa- rizes interesting findings. Section 7 contains some concluding remarks and areas for future research.

2 The California Autonomous Vehicles Driving Study

2.1 The Study

This paper presents reliability modeling and analysis of autonomous vehicles (AV) using data from the California Department of Motor Vehicles (DMV) autonomous vehicle tester program, which has been in operation since 2014. The tester program allows manufacturers to test AV on California public roads with a human in the driver seat who can take control of the vehicle if necessary. Up to July 1, 2020, 62 manufacturers are permitted to perform AV drive testing. Manufacturers are required to report disengagement events annually, and collision of AV within 10 days of the accident. Before December 1, 2017, only the aggregated number of disengagement events per month was reported, while after that date, the exact date of the event is now reported. Thus, we focus on the analysis of the data after December 1, 2017. Since it can be difficult to determine responsibility in collisions, a disengagement event is typically used as an alternative to determining unsafe auto-driving in the literature (e.g.,

5 Merkel 2018). During the period from December 1, 2017, to November 30, 2019, 34 manufac- turers reported disengagement events. We use the data from disengagement events reported by Waymo, Cruise, Pony AI, and Zoox as these four manufacturers performed extensive on-road testing during this time period. Here we provide a brief introduction to those four manufacturers. Waymo began as the Self-Driving Project in 2009, testing autonomous vehicles on public roads across six states in the United States. Waymo joined the California tester program in 2015. Cruise is an autonomous vehicle company founded in 2013. It joined the California tester program in 2016 and tested AV in the urban environment of . Pony AI was founded in 2016, developing autonomous driving technology globally. Pony AI started AV testing on California public roads in June 2017. Zoox is an autonomous vehicle company founded in 2013. They joined the California tester program in 2017 and tested AV in downtown San Francisco.

2.2 Disengagement Events Data

The California DMV requires manufacturers to report when their vehicles disengaged from autonomous mode during tests. A disengagement event occurs when there is an autonomous technology failure, or when situations requiring the test driver to take manual control of the vehicle to operate safely. Disengagement events can be initialized by a warning from the autonomous vehicle system, or by test drivers as the driver thinks it is not safe to continue auto driving. Disengagement reports are provided annually, and include the ID number of vehicles in testing, location and date of disengagement events, description of causing of disengagement, monthly autonomous mileage of each testing vehicle, and annual total autonomous miles of each testing vehicle. The starting date for the data used for this paper is December 1, 2017. The study period is from December 1, 2017, to November 30, 2019, which is a 24 month or 2 year study period. The data for the period from December 1, 2018, to November 30, 2019, are available in csv format from the California DMV website, while the data for the period from December 1, 2017, to November 30, 2018, are available in pdf format that need to be manually converted to csv format. After data cleaning, the event time is computed as the number of days since the starting date. The monthly mileage is divided by the number of days in that month to obtain the daily mileage, under the approximation that the daily mileage is constant over each month. The unit for the mileage is thousands of miles (k-miles). Figure 1 shows a subset of the recurrent events data and mileage data from manufacturer Waymo. Figure 1(a) shows the recurrent events data with the crosses representing the event times and the blue segments showing the active months (i.e., events can only be recorded within the observation windows). Figure 1(b)

6 No. 1 No. 2 No. 7 No. 12 No. 15 Unit No. Miles Driven (k−mies) Miles Driven

0 5Obs. 10 Window 15 20 Event 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

0 200 400 600 0 200 400 600

Days after 2017−12−01 Days after 2017−12−01

(a) Recurrent Events (b) Thousands of Miles Figure 1: Visualization of a subset of the recurrent events data and mileage data from manu- facturer Waymo. (a) The recurrent events data with the crosses showing the event times and the blue segments showing the active months. (b) A plot of the mileage as a function of time for five representative units. Note that the miles driven are in the units of thousands of miles (k-miles).

plots the mileage as a function of time for five representative units. Table 1 shows a summary of the recurrent events data and mileage data from the four manufacturers. We can see that Waymo and Cruise have driven more than 1 million miles and the disengagement event rate is around 0.1 events per k-miles during the 24 month testing period. Pony AI and Zoox have smaller driving miles, and the event rates are around 0.2 events per k-miles and 0.6 events per k-miles, respectively.

2.3 Notation for Data

The number of AV testing units (vehicles) in the fleet is denoted by n. The total follow-up

time is τ = 730 days (i.e., two years). Let tij, i = 1, . . . , n, j = 1, . . . , ni be event time j for unit i. Here tij records the number of days since December 1, 2017, and ni is the number of events for unit i. Note that it is possible that ni = 0, indicating that there were no events observed for unit i.

Let xi(t), 0 < t ≤ τ, be the mileage driven for unit i at time (day) t. The unit of xi(t) is k-miles. Because only monthly mileage was reported, the daily average was used for xi(t).

7 Table 1: Summary of the recurrent events data and mileage data from the four manufacturers over the 24 month study period.

No. of Active Active Months No. of Total in No. of Events Manufacturer Vehicles Months per Vehicle Events k-miles per k-miles Waymo 123 1550 12.602 224 2710.136 0.083 Cruise 304 2079 6.839 154 1278.661 0.120 Pony AI 23 179 7.783 43 190.871 0.225 Zoox 32 280 8.750 58 97.780 0.593

Thus, xi(t) is a step function, which can be represented as, n Xτ xi(t) = xil1(τl−1 < t ≤ τl). (1) l=1

Here, nτ = 24 is the number of months in the follow-up period, xil is the daily mileage for unit i during month l, τl is the ending day since the starting of the study for month l, and

1(·) is an indicator function. Let xi(t) = {xi(s) : 0 < s ≤ t} be the history for the mileage driven for unit i.

3 Statistical Models

3.1 The Nonhomogeneous Poisson Process

Recurrent events processes are commonly modeled by a nonhomogeneous Poisson process (NHPP). The event intensity function for unit i is modeled as,

λi[t; xi(t), θ] = λ0(t; θ)xi(t). (2)

Here, λ0(t; θ) = λ0(t) is the baseline intensity function (BIF) with parameter vector θ. Be- cause xi(t) is the mileage driven, λi[t; xi(t), θ] is the mileage-adjusted event intensity. The

BIF can be interpreted as the event rate per k-miles at time t when xi(t) = 1. The baseline cumulative intensity function (BCIF) is, Z t Λ0(t; θ) = Λ0(t) = λ0(s; θ)ds. (3) 0

Note that Λ0(0; θ) = 0 and Λ0(t; θ) is a non-decreasing function of t. The BCIF Λ0(t) can be interpreted as the expected number of events from time 0 to t when x(t) = 1 for all t. The cumulative intensity function (CIF) for unit i is, Z t Λi[t; xi(t), θ] = λ0(s; θ)xi(s)ds. (4) 0

8 Table 2: List of commonly used parametric models and their BIFs, BCIFs, and parameters.

Model BCIF Λ0(t; θ) BIF λ0(t; θ) Parameters 0 −1 −1 θ = (θ1, θ2) Musa-Okumoto θ1 log(1 + θ2θ1t) θ2(1 + θ2θ1t) θ1 > 0, θ2 > 0 t t 0 θ2 t θ2 θ = (θ1, θ2, θ3) Gompertz θ1θ3 − θ1θ3 θ1θ2θ3 log(θ2) log(θ3) θ1 > 0, 0 < θ2, θ3 < 1 θ = (θ , θ , θ )0 θ3 (θ3−1) θ3 1 2 3 Weibull θ1[1 − exp(−θ2t )] θ1θ2θ3t exp(−θ2t ) θ1 > 0, θ2 > 0, θ3 > 0

In software reliability, the BCIF and the BIF are used as reliability metrics when recur- rent events can be collected from repairable systems (e.g., Wood 1996). The trends in the BIF can indicate the evolution of the reliability in the underlying AV system. For example, improvement of the autonomous technology in AV can lead to a decreasing trend in the BIF (i.e., BCIF increasing at a decreasing rate).

Typically in software reliability, one specifies a parametric form for the BCIF Λ0(t; θ). Note that the BIF can be obtained by taking the derivative of BCIF with respect to t.

That is, λ0(t; θ) = dΛ0(t; θ)/dt. The commonly used models for Λ0(t; θ) are Musa-Okumoto, Gompertz, and the Weibull models (e.g., Merkel 2018). Table 2 lists their BIFs, BCIFs, and parameters. Note that the Weibull BCIF is similar to the Weibull cumulative distribution

function (cdf) but with an asymptote θ1 as t goes to ∞.

3.2 Spline Models

Although parametric models can fit certain trends in the event process, they may not be flex- ible enough to describe the event process for AV testing, as the evolution of the AI technology in an AV system can be complicated, which motivates us to propose the spline model for describing the BCIF. In the spline model, the BCIF is represented as a linear combination of spline bases. That is,

n Xs Λ0(t; θ) = βlγl(t), βl ≥ 0, l = 1, . . . , ns, (5) l=1

0 which provides a nonparametric method to describe the BCIF. Here θ = (β1, . . . , βns ) is the

vector for the spline coefficients, γl(t)’s are the spline bases, and ns is the number of spline bases. The BIF can be obtained by taking a derivative with respect to t. That is,

ns dΛ0(t; θ) X dγl(t) λ (t; θ) = = β . (6) 0 dt l dt l=1

9 Spline Basis 0.0 0.2 0.4 0.6 0.8 1.0

0 200 400 600

Time in Days

Figure 2: Examples of I-spline basis functions.

Because of the constraints that Λ0(0; θ) = 0 and that Λ0(t; θ) is a non-decreasing function of t, some special considerations are needed in the spline model. We use the I-splines in

Ramsay (1988). Figure 2 shows examples of I-spline basis functions (i.e., γl(t)). We can see that each spline basis takes value zero at t = 0 and is monotonically increasing. By taking non-negative coefficients (i.e., βl ≥ 0), a non-decreasing Λ0(t; θ) is obtained. A brief introduction on the construction of I-spline basis is as follows. The boundary knots are 0 and τ. The b interior knots are denoted by th+1, . . . , th+b for splines of order h. The complete sequence for the knots are denoted by 0 = t1 = ··· = th < th+1 < ··· < th+b < th+b+1 = ··· = t2h+b = τ. The total number of spline bases is ns = h+b. I-splines are obtained by integrating the M-splines; that is,

Z t (h) (h) Iq (t) = Mq (u)du, q = 1, . . . , b + h, t ∈ [0, τ], 0 and the M-spline bases of order h are defined recursively. The M-splines of order 1 is

(1) 1 −1 Mq (t) = (tq ≤ z < tq+1)(tq+1 − tq) , for q = 1, ··· , b + 1. The M-splines of order h are obtained as

h[(t − t )M (h−1)(t) + (t − t)M (h−1)(t)] (h) q q q+h q+1 1 Mq (t) = (tq ≤ t < tq+h), (h − 1)(tq+h − tq) for q = 1, ··· , b + h.

10 3.3 Modeling Heterogeneity

It is maybe possible that some AV units are more likely to generate more events even after accounting for the mileage driven, resulting in heterogeneity in the event process. Similar to the approach in Chapter 3 of Cook and Lawless (2007), a frailty term can be added to the parametric intensity function to model this extra heterogeneity. The gamma frailty model is,

λi[t; ui, xi(t), θ] = uiλ0(t; θ)xi(t). (7)

Here, the frailty term ui is a random variable that follows a gamma distribution with mean one and variance φ. The probability density function (pdf) of ui is, 1 f(u ) = u(1/φ−1) exp(−u /φ). i Γ(1/φ)φ1/φ i i The gamma frailty model is popular in reliability and survival applications because there is a closed-form expression for the marginal likelihood for recurrent events data based on the model in (7).

4 Model Estimation and Inference

This section presents parameter estimation procedures for the parametric and spline models, and the corresponding statistical inference procedures. This section also contains parameter estimation and hypothesis testing for the gamma frailty model.

4.1 Parameter Estimation

We use the maximum likelihood (ML) methods for parameter estimation. The likelihood function is,

n ( n ) Y Yi L(θ) = λi[tij; xi(tij), θ] × exp{−Λi[τ; xi(τ), θ]}, (8) i=1 j=1

Q0 with the convention that j=1(·) = 1. Here, the intensity function and the CIF are defined in (2) and (4), respectively. The log-likelihood function is obtained by taking the logarithm of the L(θ) in (8). That is,

n n ! X Xi l(θ) = {log[xi(tij)] + log[λ0(tij; θ)]} − Λi[τ; xi(τ), θ] (9) i=1 j=1 n n n n X Xi X Xτ = {log[xi(tij)] + log[λ0(tij; θ)]} − {xil · [Λ0(τl; θ) − Λ0(τl−1; θ)]} , i=1 j=1 i=1 l=1

11 P0 with the convention that j=1(·) = 0. The last step in (9) is obtained by substituting the expression for xi(t) in (1).

For parametric models, the functional forms of Λ0(t) and λ0(t) in Table 2 can be substituted into (9) to evaluate the log-likelihood function. The ML estimate of θ, denoted by θb, can be obtained by finding the value of θ that maximizes l(θ). For parametric models, the dimension of θ is typically 2 or 3. We used the R optim() function with the “Nelder-Mead” option to do the optimization.

For the spline model, the functional forms of Λ0(t) and λ0(t) in (5) and (6), respectively, can be substituted into (9) to evaluate the log-likelihood function. To estimate the parameter θ for the spline model, one needs to specify the locations of the knots and the number of knots, and address the non-negativity constraints on θ as indicated in (5). We use I-splines of degree 3, which is generally smooth enough to fit the BCIF. The boundary knots are set to be 0 on the left side and τ = 730 on the right side. The interior knots are set to be equally spaced sample quantiles of the observed event times. For example, if three interior knots are needed, the interior knots are set to the 0.25, 0.5 and 0.75 sample quantiles of the observed event times. After setting the knot locations, the spline bases can be computed. To select the best number of knots, we use the Akaike information criterion (AIC), which is computed as AIC = −2l(θ) + 2df.

Here, df is the number of degrees of freedom for the model. For the spline model, df is the number of non-zero coefficients. To address the non-negativity constraints on θ, we use the “L-BFGS-B” option in the R optim() function. The “L-BFGS-B” algorithm allows users to specify an interval for the variable to be optimized. We set the interval to be [0, ∞) for the spline coefficients. Using the “L-BFGS-B” algorithm, some of the elements of the estimates θb could be set to zero.

4.2 Statistical Inference

For inference based on parametric models, the normal approximation based on large sample theory can be used. Because the inference for parametric models is relatively straightforward, the focus of this section on the inference for the spline model. For the spline model, the normal approximation is inappropriate because the parameter estimates can occur at the boundary

of the parameter space (i.e., βbl can be zero). Thus, we use the fractional-random-weight bootstrap (e.g., Xu et al. 2020). Different from other bootstrap methods, the fractional- random-weight bootstrap provides a convenient way to quantify uncertainty for data with

12 complex structure. For our case, we need to handle recurrent events with window observations and time-varying mileage information, which complicate the structure. The implementation of the fractional-random-weight bootstrap is straightforward. The log-likelihood function in (9) is re-weighted as

n ni ∗ X X l (θ) = wi {log[xi(tij)] + log[λ0(tij; θ)]} (10) i=1 j=1 n n X Xτ − wi {xil · [Λ0(τl; θ) − Λ0(τl−1; θ)]} , i=1 l=1 where the random weights are independently generated from an exponential distribution with mean one. The bootstrap algorithm for generating bootstrap versions of the estimate of BCIF ∗ Λb0(t) is summarized as follows.

∗ Algorithm 1: Bootstrap Algorithm for Generating Λb0(t).

1. Generate random weights wi, i = 1, . . . , n, independently from an exponential distribu- tion with mean one. 2. Construct the re-weighted log-likelihood function as in (10). 3. Use AIC to select the best number of knots based on the log-likelihood function in (10). 4. Based on the best number of knots chosen in step 3, the corresponding spline bases, and ∗ estimated coefficients, one can compute the estimates for BCIF as Λb0(t). ∗ ∗b 5. Repeat steps 1 to 4 to obtain B copies of Λb0(t), denoted by Λb0 (t), b = 1,...,B.

∗b Based on the bootstrap estimates Λb0 (t), b = 1,...,B, one can construct approximate

100(1 − αp)% pointwise confidence interval (PCI) for Λ0(t) for a given t,

h ∗([Bαp/2]) ∗([B(1−αp/2)]) i Λb0 (t), Λb0 (t) .

∗(b) ∗b Here, Λb0 (t) is the ordered version of Λb0 (t), and [ · ] is the rounding function. Based on the spline model, one can construct a simultaneous confidence band (SCB) for the BCIF, which can be used to assess the adequacy of the parametric models in fitting the data. An approximate 100(1 − α)% SCB for Λ0(t), tL ≤ t ≤ tU , is h i Λ0(t), Λe0(t) , tL ≤ t ≤ tU , (11) e where tL and tU are boundaries to be specified. If an estimated parametric model is contained in the SCB in (11) the parametric model is statistically consistent with the data (i.e., there is no statistical evidence to reject the parametric model). Thus, the SCB constructed by the spline model provides a tool to check if a parametric model is adequate.

13 The 100(1 − αp)% PCIs provide a structure for computing SCBs for Λ0(t) for all in the t ∈

[tL, tU ] period. We use bootstrap samples to calibrate the PCIs so that it can approximately provide the nominal 100(1 − α)% CP, similar to the idea of the equal-precision SCB for a cdf in Nair (1984). The CP can be estimated as

B 1 X  ∗([Bαp/2]) ∗b ∗([B(1−αp/2)])  CP(αp) = 1 Λb (t) ≤ Λb (t) ≤ Λb (t), for all t ∈ [tL, tU ] . B 0 0 0 b=1

By setting CP(αp) = 1 − α, one can find the solution to be αc. Thus, the SCB in (11) can be computed as

h ∗([Bαc/2]) ∗([B(1−αc/2)]) i Λb0 (t), Λb0 (t) , tL ≤ t ≤ tU , which is time-efficient because the bootstrap samples have already been obtained.

4.3 The Frailty Model

To estimate the gamma frailty in (7), one needs to calculate the marginal likelihood function. The marginal likelihood function is,

n ( n ) Y Z ∞ Yi L(θ, φ) = uiλi[tij; xi(tij), θ] × exp{−uiΛi[τ; xi(τ), θ]}f(ui)dui (12) i=1 0 j=1 n ( ni ) Z ∞ Y Y ni = λi[tij; xi(tij), θ] × ui exp(−uici)f(ui)dui i=1 j=1 0 n ( ni ) −1/φ Y Y φ Γ(ni + 1/φ) = λi[tij; xi(tij), θ] × , Γ(1/φ)(c + 1/φ)(ni+1/φ) i=1 j=1 i

Pnτ where ci = l=1 xil · [Λ0(τl; θ) − Λ0(τl−1; θ)]. The ML estimates of θ and φ are obtained by finding the values that maximize the log-likelihood function, log[L(θ, φ)]. We again use the R optim() function with the “Nelder-Mead” option to do the optimization. To check if there is population heterogeneity in the event process, one can use the likelihood ratio test. The test statistic is constructed as,

−2{l(θb) − log[L(θb, φb)]}, (13)

2 which has a χ1 distribution under the null hypothesis that φ = 0. Here, l(θb) is obtained by 2 evaluating (9) at θb. The null hypothesis is rejected if the test statistic is larger than χ1, (1−α), indicating that there is population heterogeneity.

14 5 Simulation Study

The purpose of the simulation study is to show the properties of the ML estimator for the spline model, to check the CP of the SCB procedure based on the spline model, and see if the SCB can correctly accept or reject a particular parametric model.

5.1 Setting

In the simulation study, the spline model is considered as the true underlying model. The spline bases are shown in Figure 2. We consider three scenarios. Figure 3 shows the true BCIFs used in the three simulation scenarios. To simplify the setting, we only consider one parametric model, which is the Gompertz model.

• Scenario 1: we choose θ = (6, 16, 23, 11, 4)0 as the coefficients. Using the spline bases in Figure 2, the true BCIF is shown as the black/solid line in Figure 3. The Gompertz model fits this BCIF perfectly.

• Scenario 2: we choose θ = (8, 12, 28, 0, 12)0 as the coefficients. The corresponding true BCIF is shown as the red/dash line in Figure 3. The Gompertz model fits this BCIF well except for the late stage.

• Scenario 3: we choose θ = (5, 25, 0, 30, 0)0 as the coefficients. The corresponding true BCIF is shown as the green/dot-dash line in Figure 3. The Gompertz model does not fit this BCIF well due to various changes in slope over time.

The number of bootstrap samples is B = 5000 and the number of simulated datasets (i.e., repeats) is N = 1000. The mileage driven history is sampled with replacement from the historical data from Waymo. The average number of events per unit is around 1.8. The sample sizes (i.e., the number of AV units) considered in the simulation are 50, 100, 200, 500, and 1000. We evaluate the relative root mean squared errors (RMSE) for the BCIF estimator, the CP for the SCB, and the acceptance probability for the parametric model. In particular, the relative RMSE (RelRMSE) is computed as

 2 1/2 PN h i l=1 Λb0l(t) − Λ0(t) /N RelRMSE = , Λ0(t) where Λb0l(t) is the estimated BCIF using the spline model based on the lth simulated dataset, and Λ0(t) is the true BCIF. We use the relative RMSE to remove the scale effect of Λ0(t).

15 Scenario 1 Scenario 2 Scenario 3 Cumulative Intensity Function Cumulative 0 10 20 30 40 50 60

0 200 400 600

Time in Days

Figure 3: Plot of the true BCIFs used in the three simulation scenarios.

That is, RMSE tends to be large if Λ0(t) is large at a particular t. The CP is estimated by the proportion that the SCB constructed by using the spline model captures the true BCIF. The acceptance probability is estimated by the proportion of times that the spline-based SCB captures the estimated BCIF from the Gompertz model.

5.2 Results

Figure 4 shows the plots of the RelRMSE as a function of time using the spline model and the Gompertz model to fit the data under the three scenarios. As we can see from the figure, the RelRMSE is generally decreasing as the sample size increases for the spline model across all the three scenarios. When the sample size is large, the RelRMSE for the spline model estimator is within a small range. The behavior of the RelRMSE for the Gompertz model depends on the scenario. For Scenario 1, in which the Gompertz model fits well to the true model, the spline model tends to have a higher RelRMSE because there are more parameters in the spline model (and thus more variability in the estimates) than in the parametric model. For Scenario 2, in which there is some departure in the late stage, the Gompertz model has smaller RelRMSE than the spline model but the advantage diminishes when the sample size increases because bias begins to dominate variance. For Scenario 3, in which there is a large difference from the true model, the Gompertz model tends to have larger RelRMSE than the spline model and the RelRMSE does not decrease much when the sample size increases due to the effect of large

16 bias. In summary, the ML estimator for the spline model works as expected. When a parametric BCIF is adequate, it tends to have higher statistical efficiency. When the parametric model is not adequate, there could be large RelRMSE due to bias in the estimation. The results show that it is important to assess the adequacy of parametric models. The spline model, however, is flexible at the price of losing some statistical efficiency. Figure 5 shows the plot of CP and acceptance probability as a function of sample size under the three scenarios. From Figure 5(a), the CP for the SCB procedure based on the spline model and bootstrap is quite similar for the three scenarios. The CP improves when the sample size increases. The CP is less than nominal when the sample size is small and is getting closer to the nominal CP when the sample is larger than 200. From the plot of the acceptance probability in Figure 5(b), the parametric model is generally accepted when the sample size is small and there is little or no departure from the true model. When there is a large departure from the true model, as in Scenario 3, the SCB does not capture the estimated Gompertz model with high probability when the sample size is large.

6 Data Analysis

In this section, we present the data analysis for the recurrent disengagement events data.

6.1 Model Fitting

We start by fitting the parametric models in Table 2 and the spline model to the disengagement- events data from the four manufacturers: Waymo, Cruise, Pony AI, and Zoox. The model fitting uses the ML estimation procedures described in Section 4. Table 3 shows AIC values for the selected models fitted to the data from the four manufacturers. The numbers with bold font indicate the lowest AIC values among the parametric models. The spline model, in general, results in much lower AIC values except for Zoox data. This shows that the spline model is flexible in fitting the recurrent event data. For Zoox, the AIC value for the Musa- Okumoto model is small because the model has two parameters. Based on the AIC values, the best parametric model for Waymo and Cruise is the Gompertz model. The best paramet- ric model for Pony AI is the Weibull model, and the best parametric model for Zoox is the Musa-Okumoto model. To visualize the model estimation results, Figure 6 shows the plots of the estimated BCIFs for the four manufacturers based on the spline model and parametric models, together with the 95% SCB based on the spline model. For Waymo and Cruise, all the parametric models are within the SCB and agree quite well with the estimated BCIF from the spline model. The

17 n=50 n=50 n=100 n=100 n=200 n=200 n=500 n=500 n=1000 n=1000 Relative RMSE Relative RMSE Relative 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700

Time in Days Time in Days (a) Scenario 1, Spline Model (b) Scenario 1, Gompertz

n=50 n=50 n=100 n=100 n=200 n=200 n=500 n=500 n=1000 n=1000 Relative RMSE Relative RMSE Relative 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700

Time in Days Time in Days (c) Scenario 2, Spline Model (d) Scenario 2, Gompertz

n=50 n=50 n=100 n=100 n=200 n=200 n=500 n=500 n=1000 n=1000 Relative RMSE Relative RMSE Relative 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700

Time in Days Time in Days (e) Scenario 3, Spline Model (f) Scenario 3, Gompertz Figure 4: Plots of relative RMSE as a function of time using the spline model and Gompertz model to fit the data under the three scenarios.

18 Coverage Probability Coverage Acceptance Probability

Scenario 1 Scenario 1 Scenario 2 Scenario 2 Scenario 3 Scenario 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

50 100 200 500 1000 50 100 200 500 1000

Number of Units Number of Units

(a) Coverage Probability (b) Acceptance Probability Figure 5: Plots of coverage probability and acceptance probability as a function of sample size under the three scenarios.

Table 3: The values of AIC for fitting various models to the data from the four manufacturers. The numbers with bold font indicate the lowest AIC values among the parametric models.

Parametric Models Manufacturer Spline Model Musa-Okumoto Gompertz Weibull Waymo 2769.78 2769.70 2770.60 2756.21 Cruise 2051.27 2047.42 2048.28 2046.09 Pony AI 499.79 504.56 498.73 479.73 Zoox 687.69 689.55 689.38 688.78

19 Gompertz model agrees with the observed number of events better than other models over certain time ranges. For Pony AI, the SCB is wide, indicating large variability in the estimation. In the early stage of the testing (i.e., from day 0 to day 200), all the events were coming from two units with

a lower mileage driven at xi(t) = 0.01. A high number of events with a low mileage driven leads to a high event rate. The SCB is asymmetric because it was built using fractional- random-bootstrap and the distribution of Λb0(t) is heavily skewed. The fact that all events come from two units leads to a large amount of variability in the SCB as the change of weights of the two units has a large influence on the re-weighted log-likelihood. We also note that the best parametric model (the Weibull) does not agree well with the spline model, although it tracks the trend. For Zoox, the SCB is also relatively wide due to the small number of test units. All parametric models are within the SCB, indicating all parametric models are statistically acceptable. To further check how well the models fit the data, Figure 7 shows the plots of the expected versus the observed number of events for the four manufacturers based on the spline model and parametric models, together with the 95% PCIs based on the spline model. The expected number of events is computed based on the specific model with the adjustment for the mileage history from all units. The PCIs for the expected number of events are based on bootstrap samples. The shape of the function of the cumulative number of events differs from the shape of the BCIF because the function of the cumulative number of events is adjusted by the rate xi(t) which is time-varying and depends on the driving pattern, while BCIF is the case when xi(t) = 1. In all cases, the spline model tracks the cumulative number of observed events well. For Waymo, Cruise, and Zoox, the Gompertz and Weibull models also track the counts well, but visually we can see some departures for the Musa-Okumoto model. For Pony AI, all three parametric models show significant departures in the plot, indicating they are not flexible enough to describe that event process.

6.2 Results Interpretation

Table 4 shows the parameter estimates, standard errors, and approximate 95% confidence intervals (CIs) for the best parametric models for the four manufacturers. The CIs for pa- rameters are based on a normal (Wald) approximation, and some transformations (e.g., a logarithm transformation for positive parameters) are used to improve the performance. Note that the estimate for θ1 for the Zoox/Musa-Okumoto is set to zero because the model is degenerate at θ1 = 0, indicating a constant rate situation. We also tested population heterogeneity using the procedure in Section 4.3. Table 5 shows the summary of the gamma frailty models for the four manufacturers. All of the p-values are

20 Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% SCB 95% SCB Cumulative Intensity Function Cumulative Intensity Function Cumulative 0 20 40 60 80 0 20 40 60 80 100 120 140

0 200 400 600 0 200 400 600

Time in Days Time in Days

(a) Waymo (b) Cruise

Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% SCB 95% SCB Cumulative Intensity Function Cumulative Intensity Function Cumulative 0 1000 2000 3000 4000 0 100 200 300 400 500 600

0 200 400 600 0 200 400 600

Time in Days Time in Days

(c) Pony AI (d) Zoox Figure 6: Plots of the estimated BCIFs for the four manufacturers based on the spline model and parametric models, together with the 95% SCB based on the spline model.

21 Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% Pointwise 95% Pointwise Data Data Cumulative Number of Events Cumulative Number of Events Cumulative 0 50 100 150 200 250 0 50 100 150 200

0 200 400 600 0 200 400 600

Time in Days Time in Days

(a) Waymo (b) Cruise

Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% Pointwise 95% Pointwise Data Data Cumulative Number of Events Cumulative Number of Events Cumulative 0 10 20 30 40 50 60 0 20 40 60 80

0 200 400 600 0 200 400 600

Time in Days Time in Days

(c) Pony AI (d) Zoox Figure 7: Plots of the expected versus the observed number of events for the four manufacturers based on the spline model and parametric models, together with the 95% PCIs based on the spline model.

22 close to 1, indicating little population heterogeneity among the event processes for the four manufacturers. Hence, it is reasonable to use a model for which all units have the same event process with the same BCIF. Table 5 also lists the names for the best parametric models for each manufacturer. To further visualize the results, Figure 8 plots the BIFs from the best parametric models for each of the four manufacturers. From the plot, we see the trends for different manufacturers. A decreasing trend means an improvement in AI technology and an increase in AV reliability. Waymo, Cruise, and Pony AI display a decreasing trend, while Zoox displays a constant rate of about 0.6 events per k-miles. Waymo starts at an event rate near 0.1 events per k-miles and decreases over the two-year period. Cruise starts at 0.2 events per k-miles and shows a decreasing trend. Pony AI starts at a high rate of 10 events per k-miles and shows a rapidly improving rate. By the end of the study period (i.e., November 30, 2019), the event rate for Waymo and Cruise is around 0.05 events per k-miles. The event rate for Pony AI is around 0.1 events per k-miles. This pattern indicates that there is a lot of improvement for the reliability of the Pony AI driving system. From the analysis conducted in this section for different manufacturers, we can see that the overall AV reliability is improving over the two-year period. As a comparison, Figure 9 shows the estimated BIFs based on the spline model and the parametric models, and the 95% PCIs based on the spline model. We can see that the spline model shows more variation but the general trends are the same as the parametric models. The estimates for Zoox have a large amount of variability due to the limited sample size, as discussed previously.

7 Concluding Remarks

This paper focuses on the reliability analysis of AV technology using the recurrent disen- gagement events from the California driving study. We propose a statistical framework for modeling and analyzing the recurrent events data from AV driving tests. We use both para- metric models and a nonparametric model to describe the event processes. Based on the spline model, we can select the best model, quantify uncertainty, and test heterogeneity in the event process. We want to point out that the parametric models and spline models are proposed as complementary tools for modeling and inference. Through the simulation and data analysis, we show that the proposed spline model is flexible for describing the recurrent events data from four AV manufacturers, and parametric models are adequate for data from most manufacturers. It is worth noting that the best parametric models can be different for different manufacturers. The population heterogeneity

23 Table 4: Parameter estimates, standard errors, and approximate 95% CIs for the best para- metric models for the four manufacturers.

Manufacturer/ 95% CI Parameter Estimate Std. Err. Model Lower Upper Waymo θ1 102.2539 31.7600 55.6278 187.9610 / θ2 0.9975 0.0009 0.9951 0.9987 Gompertz θ3 0.1623 0.1229 0.0319 0.5326 Cruise θ1 171.3352 66.1936 80.3503 365.3472 / θ2 0.9963 0.0009 0.9941 0.9977 Gompertz θ3 0.3064 0.2285 0.0510 0.7842 Pony AI θ1 817.203 273.828 423.744 1575.997 / θ2 0.0474 0.0474 0.0067 0.3363 Weibull θ3 0.6304 0.1615 0.3815 1.0416 Zoox/ θ1 0.0000 0.0000 0.0000 0.0000 Musa-Okumoto θ2 0.5933 0.0779 0.4587 0.7674

Table 5: Summary of the gamma frailty models for the four manufacturers.

Manufacturer Best Parametric Model Variance (φb) p-value Waymo Gompertz 0.0001 0.9721 Cruise Gompertz 0.0000 0.9972 Pony AI Weibull 0.0000 0.9916 Zoox Musa-Okumoto 0.0197 0.8400

24 10.00 Waymo − Gompertz Cruise − Gompertz 5.00 Pony AI − Weibull Zoox − Musa−Okumoto

2.00

1.00

0.50

Intensity Function 0.20

0.10

0.05

0.02

0 200 400 600

Time in Days

Figure 8: Plot of BIFs from the best parametric models for the four manufacturers. in the event process is also low. From the data analysis, we found that the overall AV reliability is improving over the two-year study period. The currently available data do not include covariates, such as the driving speed when the event occurred, the test environment (e.g., busy street versus freeway), and vehicle models. In the future, it would be interesting and useful to collect more covariate information. Our proposed modeling framework can be extended to analyze recurrent disengagement events data with covariates. The CA DMV recently started a driverless program where cars on the road do not have to have a driver. It would be interesting to analyze the driverless study data in the future when enough event data are available. Because reliability is a property that evolves over time, all kinds of AI systems need to be tested over time to quantify their reliability. Although our analysis focuses on AV reliability, our proposed data analytic framework can also be applied to assess the reliability of other AI systems where departures from desired behavior provide recurrent events data. With an appropriate definition of time scale and events, the parametric and spline models discussed in this paper can be extended to analyze data from the reliability testing of other AI systems. Computing hardware reliability is also an important aspect of AI reliability. For example, GPU’s are widely used in AI model computing. Ostrouchov et al. (2020) considered the lifetimes of GPUs used supercomputers. It would be interesting to study GPU reliability, or more broadly computing hardware reliability, and its relationship to AI reliability.

25 Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% Pointwise 95% Pointwise Intensity Function Intensity Function 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0 200 400 600 0 200 400 600

Time in Days Time in Days

(a) Waymo (b) Cruise

Spline Model Spline Model Gompertz Gompertz Musa−Okumoto Musa−Okumoto Weibull Weibull 95% Pointwise 95% Pointwise Intensity Function Intensity Function 0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0 200 400 600 0 200 400 600

Time in Days Time in Days

(c) Pony AI (d) Zoox Figure 9: Plots of the estimated BIFs for the four manufacturers based on the spline model and parametric models, together with the 95% PCIs based on the spline model.

26 Acknowledgments

The authors acknowledge the Advanced Research Computing program at Virginia Tech for providing computational resources. The work by Hong was partially supported by National Science Foundation Grant CMMI-1904165 to Virginia Tech.

References

Alshemali, B. and J. Kalita (2020). Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems 191, 105210. Amodei, D., C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mane (2016). Concrete problems in AI safety. arXiv: 1606.06565 . Asljung,˚ D., J. Nilsson, and J. Fredriksson (2017). Using extreme value theory for vehicle level safety validation and implications for autonomous vehicles. IEEE Transactions on Intelligent Vehicles 2, 288–297. Banerjee, S. S., S. Jha, J. Cyriac, Z. T. Kalbarczyk, and R. K. Iyer (2018). Hands off the wheel in autonomous vehicles?: A systems perspective on over a million miles of field data. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 586–597. IEEE. Boggs, A. M., B. Wali, and A. J. Khattak (2020). Exploratory analysis of automated vehi- cle crashes in California: A text analytics & hierarchical Bayesian heterogeneity-based approach. Accident Analysis and Prevention 135, 105354. Bosio, A., P. Bernardi, A. Ruospo, and E. Sanchez (2019). A reliability analysis of a deep neural network. In 2019 IEEE Latin American Test Symposium (LATS), pp. 1–6. Burke, K., M. C. Jones, and A. Noufaily (2020). A flexible parametric modelling framework for survival analysis. Journal of the Royal Statistical Society: Series C 69, 429–457. Burton, S., I. Habli, T. Lawton, J. McDermid, P. Morgan, and Z. Porter (2020). Mind the gaps: Assuring the safety of autonomous systems from an engineering, ethical, and legal perspective. Artificial Intelligence 279, 103201. California Department of Motor Vehicles. Autonomous vehicle tester program. [On- line]. Available: https://www.dmv.ca.gov/portal/vehicle-industry-services/ autonomous-vehicles/, accessed: September 01, 2020. Cook, R. J. and J. F. Lawless (2007). The Statistical Analysis of Recurrent Events. New York: Springer-Verlag.

27 Cruise. [Online]. Available: https://www.getcruise.com/, accessed: September 01, 2020. Dixit, V. V., S. Chand, and D. J. Nair (2016). Autonomous vehicles: disengagements, accidents and reaction times. PLoS One 11, e0168054. Duchateau, L. and P. Janssen (2008). The Frailty Model. New York: Springer-Verlag. Ehrlich, W. K., V. N. Nair, M. S. Alam, W. H. Chen, and M. Engel (1998). Software reliability assessment using accelerated testing methods. Journal of the Royal Statistical Society: Series C 47, 15–30. Favar`o,F., S. Eurich, and N. Nader (2018). Autonomous vehicles disengagements: Trends, triggers, and regulatory limitations. Accident Analysis & Prevention 110, 136–148. Goldstein, B. F., S. Srinivasan, D. Das, K. Banerjee, L. Santiago, V. C. Ferreira, A. S. Nery, S. Kundu, and F. M. G. Frana (2020). Reliability evaluation of compressed deep learning models. In 2020 IEEE 11th Latin American Symposium on Circuits Systems (LASCAS), pp. 1–5. Hong, Y., L. A. Escobar, and W. Q. Meeker (2010). Coverage probabilities of simultaneous confidence bands and regions for log-location-scale distributions. Statistic & Probability Letters 80, 733–738. Hong, Y., M. Li, and B. Osborn (2015). System unavailability analysis based on window- observed recurrent event data. Applied Stochastic Models in Business and Industry 31, 122–136. Huang, C.-Y., M. R. Lyu, and S.-Y. Kuo (2003). A unified scheme of some nonhomoge- nous Poisson process models for software reliability estimation. IEEE Transactions on Software Engineering 29, 261–269. Jin, Z., Z. Ying, and L. J. Wei (2001). A simple resampling method by perturbing the minimand. Biometrika 88, 381–390. Kalra, N. and S. M. Paddock (2016). Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice 94, 182–193. Lawless, J. F. (1995). The analysis of recurrent events for multiple subjects. Journal of the Royal Statistical Society. Series C 44, 487–498. Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data (2nd ed.). New Jersey, Hoboken: John Wiley & Sons, Inc. Lv, C., D. Cao, Y. Zhao, D. J. Auger, M. Sullman, H. Wang, L. M. Dutka, L. Skrypchuk, and A. Mouzakitis (2018). Analysis of autopilot disengagements occurring during au- tonomous vehicle testing. IEEE/CAA Journal of Automatica Sinica 5, 58–68.

28 Meeker, W. Q. and L. A. Escobar (1998). Statistical Methods for Reliability Data. New York: John Wiley & Sons, Inc. Merkel, R. (2018). Software reliability growth models predict autonomous vehicle disen- gagement events. arXiv: 1812.08901 . Meyer, M. C. (2008). Inference using shape-restricted regression splines. The Annals of Applied Statistics 2, 1013–1033. Michelmore, R., M. Wicker, L. Laurenti, L. Cardelli, Y. Gal, and M. Kwiatkowska (2019). Uncertainty quantification with statistical guarantees in end-to-end autonomous driving control. arXiv: 1909.09884 . Musa, J. D. and K. Okumoto (1984). A logarithmic Poisson execution time model for software reliability measurement. In Proceedings of the 7th International Conference on Software Engineering, pp. 230–238. IEEE Press. Nair, V. N. (1984). Confidence bands for survival functions with censored data: a compar- ative study. Technometrics 26, 265–275. Ostrouchov, G., D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H. Rogers (2020). GPU lifetimes on Titan supercomputer: Survival analysis and reliability. In Pro- ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), New York, NY. Association for Computing Machinery. Pony AI. [Online]. Available: https://www.pony.ai/en/index.html, accessed: September 01, 2020. Ramsay, J. O. (1988). Monotone regression splines in action. Statistical Science 3, 425–441. Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics 9, 130–134. Shan, Q., Y. Hong, and W. Q. Meeker (2020). Seasonal warranty prediction based on recurrent event data. Annals of Applied Statistics 14, 929–955. Waymo. [Online]. Available: https://waymo.com/, accessed: September 01, 2020. Wood, A. (1996). Software reliability growth models. Tandem Technical Report 96.1, Tan- dem Computers, Cupertino, CA. Xie, Y., C. B. King, Y. Hong, and Q. Yang (2018). Semi-parametric models for accelerated destructive degradation test data analysis. Technometrics 60, 222–234. Xu, L., C. Gotwalt, Y. Hong, C. B. King, and W. Q. Meeker (2020). Applications of the fractional-random-weight bootstrap. The American Statistician 74, 345–358. Zhao, X., A. Banks, J. Sharp, V. Robu, D. Flynn, M. Fisher, and X. Huang (2020). A safety framework for critical systems utilising deep neural networks. arXiv: 2003.05311 .

29 Zhao, X., V. Robu, D. Flynn, K. Salako, and L. Strigini (2019). Assessing the safety and reliability of autonomous vehicles from road testing. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), pp. 13–23. Zoox. [Online]. Available: https://zoox.com/, accessed: September 01, 2020. Zuo, J., W. Q. Meeker, and H. Wu (2008). Analysis of window-observation recurrence data. Technometrics 50, 128–143.

30