INFORMS JOURNAL ON COMPUTING Articles in Advance,pp.1–14 http://pubsonline.informs.org/journal/ijoc/ ISSN1091-9856(print),ISSN1526-5528(online)

Robust Maximum Likelihood Estimation

Dimitris Bertsimas,a Omid Nohadanib a Operations Research Center and Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139; b Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60208 Contact: [email protected], http://orcid.org/0000-0002-1985-1003 (DB); [email protected], http://orcid.org/0000-0001-6332-3403 (ON)

Received: January 10, 2016 Abstract. In many applications, statistical serve to derive conclusions from Revised: February 12, 2017; December 19, 2017 , for example, in finance, medical decision making, and clinical trials. However, the Accepted: April 30, 2018 conclusions are typically dependent on uncertainties in the data. We use robust opti- Published Online in Articles in Advance: mization principles to provide robust maximum likelihood estimators that are protected April 26, 2019 against data errors. Both types of input data errors are considered: (a) the adversarial https://doi.org/10.1287/ijoc.2018.0834 type, modeled using the notion of uncertainty sets, and (b) the probabilistic type, modeled by distributions. We provide efficient local and global search algorithms to compute the robust Copyright: © 2019 INFORMS estimators and discuss them in detail for the case of multivariate normally distributed data. The performance is demonstrated on two applications. First, using computer simulations, we demonstrate that the proposed estimators are robust against both types of data uncertainty and provide more accurate estimates compared with classical estimators, which degrade significantly, when errors are encountered. We establish a of uncer- tainty sizes for which robust estimators are superior. Second, we analyze deviations in cancer radiation therapy planning. Uncertainties among plans are caused by patients’ individual anatomies and the trial-and-error nature of the process. When analyzing a large set of past clinical treatment data, robust estimators lead to more reliable decisions when applied to a large set of past treatment plans.

History: Accepted by Karen Aardal, Area Editor for Design and Analysis of Algorithms. Funding: O. Nohadani received support from the National Science Foundation [Grant CMMI-1463489]. Supplemental Material: The online supplement is available at https://doi.org/10.1287/ijoc.2018.0834.

Keywords: optimization • robust optimization • robust • maximum likelihood estimator • radiation therapy

1. Introduction of maximum likelihood estimators completely. There- Maximum likelihood is widely used to successfully con- fore, it is instrumental to the success of the application to struct statistical estimators for parameters of a proba- construct estimators that are intrinsically robust against bility distribution. This method is prevalent in many possible sources of uncertainty. applications, ranging from and machine In this work, we seek to estimate the parameter θ for learning to many areas of science and engineering, the probability density function f θ, x of an ensemble m ( ) where the nature of the data motivates a functional of n data points xi R , which can be contaminated by ∆ m ∈ form of the underlying distribution. The parameters of errors xi R in some uncertainty set 8. The robust ∈ this distribution remain to be estimated (Pfanzagl 1994). MLE maximizes the worst-case likelihood via For uncertain distributions, the estimators can be lo- n max min f θ; x ∆x , cated in a family of them by leveraging the minimax ∆ ∏ i i (1) θ X 8 i 1 ( − ) asymptotic (Huber et al. 1964, Huber 1996). ∈ ! This is also possible in the common case of symmetric following the robust optimization (RO) paradigm. contamination (Jaeckel 1971). Maximum likelihood Robust optimization has increasingly been used as estimator (MLE) methods typically assume data to be an effective way to immunize solutions against data complete, precise, and free of errors. In reality, however, uncertainty. In principle, if errors are not taken into data are often insufficient. Moreover, any input data account, an otherwise optimal solution may turn out can be subject to errors and perturbations. These can to be suboptimal, or even in some cases infeasible. RO, stem from (a) measurement errors, (b) input errors, however, considers errors to reside within an uncer- (c) implementation errors, (d) numerical errors, or (e) model tainty set and aims to calculate solutions that are robust errors. These sources of uncertainty affect the quality to such uncertainty. There is a sizable body of literature of the estimators and can degrade outcomes signifi- on various aspects of RO, and we refer to Ben-Tal et al. cantly, so much so that we might lose the advantages (2009) and Bertsimas et al. (2011). In the context of

1 Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 2 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS simulation-based problems, that is, problems not given data, where errors are observed on the data and a priori by a closed form solution, a local search algorithm was information on distributions is not available. proposed that provides robust solutions to unconstrained In principle, errors may be of an adversarial nature, (Bertsimas et al. 2010b)andconstrained(Bertsimasetal. where no probabilistic information about their source is 2010a)problemswithoutexploitingthestructureofthe known, or of a distributional nature, where the source is problem or the uncertainty set. known to be probabilistic. Correspondingly, we discuss The effects of measurement errors on statistical es- two kinds of robust maximum likelihood estimators: timators have been addressed extensively in the liter- 1. Adversarially robust: The worst-case scenario is ature, for example, by Buonaccorsi (2010) and the calculated among possible errors that reside in some references within. Measurement error models assume uncertainty set. To compute these estimators, we pro- a distribution of the observed values given the true pose two methods: a first-order gradient descent al- values of a certain quantity (Fuller 2009). This is the gorithm, which is highly efficient and warrants local reverse of the Berkson error model, which assumes optimal robust estimators, and a global search method a distribution on the true values given the observed based on robust simulated annealing, which provides values (Berkson 1950). The rich literature for error cor- global optimal robust estimators at higher computa- rection provides a plethora of techniques for correcting tionally expense. additive errors in (Cheng et al. 1999). 2. Distributionally robust: The worst-case scenario is El Ghaoui and Lebret (1997) showed that robust least evaluated among errors that are independent and fol- squares problems for erroneous but bounded data low some distribution residing in some set of distri- can be formulated as second-order cone or semidefi- butions. Such errors resemble persistent errors. Using nite optimization problems, and thus become effi- distributional robust optimization techniques, we show ciently solvable. In the context of maximum likelihood, that their estimators are a particular case of adversa- Calafiore and El Ghaoui (2001) elaborated on estima- rially robust estimators. tors in linear models in the presence of Gaussian noise To demonstrate the performance of the methods, we whose parameters are uncertain. The proposed esti- apply our methods to two types of data sets. First, we mators maximize a lower bound on the worst-case conduct numerical on simulated data to likelihood using semidefinite optimization. ensure a controlled setting and to be able to determine In the context of robust statistics, Huber (1980) in- the from true data. We show that for small- troduced estimators that are insensitive to perturba- sized errors, both the local and the global RO methods tions. The robustness of estimators is measured in yield comparable estimates. For larger errors, how- different ways: For instance, the breakdown point is ever, we observe that the robust simulated annealing defined as the minimum amount of contamination that method outperforms the local search method. Moreover, causes the estimator to become unreliable. Another we show that the proposed estimators are also immune measure is the influence curve that describes the impact against the source of uncertainty; that is, even if the of to an estimator (Hampel 1974). errors follow a different distribution than anticipated, In this paper, we introduce a robust MLE method to the estimators remain robustly optimal. Furthermore, produce estimators that are robust to data uncertainty. the proposed robust estimators turn out to be signifi- Our approach differs from Huber’s(1980) in multiple cantly more accurate compared with classical maxi- facets, as summarized in Table 1. Because the likeli- mum likelihood estimators, which degrade sizably, hood is not proportional to the error in the estimation, when errors are encountered. Finally, we establish the our proposed method considers the worst case directly range within which the robust estimators are most in the likelihood. Correspondingly, we believe that our effective. This range can inform practitioners about proposed approach is directly relevant to real-world the appropriate size of the uncertainty set. The error size–dependent observation can be generalized to a broader range of RO approaches. Table 1. Comparison of Huber’s(1980) Approach to the In the second application, past patient data for cancer Proposed RO Approach radiation therapy serve to probe the method. In clini- Comparison Huber’s approach Our approach cal practice, the quality of treatment plans is typically examined based on the spatial dose distribution, sum- Estimators Functionals on the space Functions on the marized in five specific observable dose points and of distributions observed data Worst case In the value of the In the value of the compared with internally recommended criteria. How- estimator likelihood ever, the recording and evaluation of these criteria are Observables Observed distributions True data reside in subject to human uncertainty. To evaluate the overall reside in a neighborhood a neighborhood performance of an individual clinician, a team, or an around the true around the entire institution, statistical estimators are employed. distribution observed data The resulting decisions remain highly sensitive to the Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 3 uncertainty in data. We analyze 491 treatment plans uncertainty set, which leads to adversarially robust for various tumor sites that have already undergone estimators (AREs), introduced in Section 2.1; radiation therapy and compute robust estimators to (b) errors can be considered as random variables support physicians in arriving at more dependable with known support, which leads to designing dis- conclusions. We show that robust estimators have a tributionally robust estimators (DREs), introduced in significantly reduced and stable spread over different Section 2.3. samples, when compared with nominal estimators, In both cases, we assume the magnitude of ∆X to offering more reliable and sample-independent decision be sizably smaller than that of Xtrue, making the esti- making. mators identifiable. When they are of comparable sizes, The structure of this paper is as follows: In Section 2, the distinction of their parameters fades. However, we define the robust estimators and introduce the RO a discussion on identifiability for general cases is be- problem for robust estimators. In Section 2.1, we dis- yond the scope of this work. With respect to likelihood cuss the robust maximum likelihood estimators, along methods, our approach can be regarded as semiparametric with a corresponding local as well as a global search because we specify only the distribution of Xtrue para- algorithm. Section 2.1 also details the method to com- metrically, and not that of ∆X. pute the corresponding robust estimates when the data follow a specific distribution, namely, the multivariate 2.1. Adversarial Robust Maximum . In Section 3, we report on the per- Likelihood Estimators formance of the proposed robust estimators on simu- In most real-world applications, the knowledge about lated data. In Section 4, we compute robust estimators the error distribution is not accessible and at times not for a large set of clinical cancer radiation therapy data. even existent. For example, medical records are often In Section 5, we conclude our findings. manually transferred from a diagnostic device onto paper and later into an electronic form, and during each 2. Robust Maximum Likelihood Estimators step, copying errors and uncertainties from unit con- To introduce the robust estimators, we first differentiate version may occur. The nature of these errors cannot between observed and true data. Consider the following be associated to a known distribution. Therefore, fol- setting: we can only observe samples xobs, i 1, 2, ..., n, ∆ i ! lowing the RO paradigm, we model such errors X which may include errors. This is expressed via as belonging to an uncertainty set, which is assumed to be a convex set and denoted by 8. The set 8 is xobs xtrue ∆x , i 1, 2, ..., n, i ! i + i ! typically determined by the underlying application; for true example, the accuracy of a measurement device de- where xi is the error-free (but not observable) data, and ∆xi is the error in the ith sample. The error-free termines the size of measurement errors in data. When true the Euclidean norm of the errors is bounded by a pa- data xi are assumed to be distributed according to a distribution 0 θ , with probability density function rameter Γ > 0, the corresponding uncertainty set can be f θ; x , where θ is( the) parameter we wish to estimate. expressed as ( Maximum) likelihood estimator seeks to find θ that u maximizes the probability density function 8 ∆X ∆x1, ∆x2, ..., ∆xn ∆xi Γ, ! ![ ] ∥ ∥2 ≤ n n % i n & true obs ∆ 1, 2, ..., . & (4) ∏ f θ; xi ∏ f θ; xi xi , (2) ! & i 1 ≡ i 1 − ' ! ! or equivalently! maximizes" ! the log-likelihood" density Although this set serves to clarify the exposition, our function gradient descent approach does not this spe- cific structure, as will be discussed. We seek to find θ n obs obs that maximizes the log-likelihood in Equation (3) against ψ θ; X ∆X log ∏ f θ; x ∆Xi , (3) − ≡ i − all errors (in particular the worst-case one) in 8. #$i 1 ! " ! ! " Therefore, the ARE is the solution to where Xobs denotes the ensemble of the observed data obs ∆ ∆ n xi , and X the ensemble of xi as obs ∆ max min log f θ; xi xi . (5) θ ∆X 8 − u ∈ i 1 Xobs xobs, xobs, ..., xobs , and (! ![ 1 2 n ] ! ! "" u 8 ∆X ∆x1, ∆x2, ..., ∆xn . Note that if there are no errors, 0 , and prob- ![ ] lem (5) corresponds to classical maximum!{ } likelihood In what will follow, the errors ∆xi, i 1, 2, ..., n, are estimation. As the size of 8 increases, the ARE may modeled in two different ways: ! become more conservative, that is, we attempt to im- (a) no further knowledge about the nature of errors munize against larger errors, potentially at the expense is available, and we consider them to reside within an of lower likelihood. Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 4 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS

2.2. Computation of AREs In summary, if we can compute the directional de- To compute adversarial maximum likelihood estimates, rivative of the inner function of a robust optimization we first discuss the gradient-based robust optimization problem, the above method can efficiently provide method for nonconvex cost functions before extending robust solutions. We now extend this approach to it to compute local optimal estimators. We then intro- compute robust estimators. duce a global search method based on robust simulated annealing that warrants global robust optimality. 2.2.2. Local Optimal Estimator. Let φ θ be the solution to the inner minimization problem ((5)) as 2.2.1. Robust Nonconvex Optimization. In general, for obs a continuously differentiable and possibly nonconvex φ θ min ψ θ; X ∆X (8) ( ) ≡ ∆X 8 − cost function f x , where x Rn is the decision or data ∈ ! " vector, the robust( ) optimization∈ problem is given through with ∆ ∆ obs ∆ min g x min max f x x . (6) X∗ θ arg min ψ θ; X X . (9) x ( ) ≡ x ∆x 8 ( + ) ( ) ≡ ∆X 8 − ∈ ∈ ! " Here, the errors ∆x directly affect the decision variables By applying Danskin’s(1966) theorem, we have and are bound in an uncertainty set 8. The robust obs problem can be solved by updating x along descent =θφ θ =θψ θ; X ∆X∗ θ0 . (10) ( ) θ θ0 ! ( − ( )) θ θ0 directions that reduce g by excluding worst errors. In & ! & ! & & other words, d is a descent direction for the robust Note that we& do not need to calculate& the gradient obs optimization problem (6) at x, if the directional de- at X ∆X∗ θ . Given the ability to calculate Equa- rivative in direction d satisfies the following condition: tion( (10),− we can( )) construct a gradient descent algorithm g′ x < 0. (7) with diminishing step size, which has been shown to ( ) converge to a local minimum (Bertsimas et al. 2010b). Note that for problem (6), it may not be possible to find ∆x∗ arg max∆x 8 f x ∆x or the solution may not be 2.2.3. The Case of Multivariate Normal Distribution. To ! ∈ ( + ) unique. However, it has been shown that when it is demonstrate the performance of this approach, some ∆ ∆ possible to provide a collection of } x1, ..., xm specifications on the underlying distribution of the data ∆ ∆ ∆ !{ } with xi 8 and x∗ i ∆xi } λi xi for some λi 0, are necessary. We use the example of the multivariate ∈ ! | ∈ u ≥ then d is a descent direction for g x; d ,ifd ∆xi < 0 normal distribution of the observed data. In particular, ) ( ) ∀∆xi } (Bertsimas et al. 2010b). Furthermore, such the framework for computing AREs is employed to ∈ a descent direction points away from all the worst estimate the µ Rm and the matrix ∈ implementation errors in 8, as shown by the following Σ Sm+ , where Sm+ is the set of m × m symmetric and theorem: positive∈ semidefinite matrices. The probability density function for some observed data xobs is Theorem 1. Suppose that f x is continuously differentiable, 8 ( ) 8 the uncertainty set is defined as in (4), and ∗ x ≔ obs 1 n ( ) f µ, Σ; x ∆x∗ ∆x∗ arg max f x ∆x . Then, d R is a descent m/2 1/2 | ∈ ∆x 8 ( + ) ∈ ! 2π |Σ| %direction for the worst-case∈ cost' function g x at x xˆ if ! " ( ) (11) ( ) ! ∆ 1 obs u 1 obs and only if for all x∗ 8∗ xˆ , · exp x µ Σ− x µ . ∈ ( ) −2( − ) ( − ) u #$ d ∆x∗ < 0, Using the uncertainty set defined in (4), problem (8) for =x f x xˆ ∆x∗ ≠ 0. ( ! + ) the estimators θ µ, Σ becomes !( ) Note that all descent directions d reside in the strict in- obs obs φ µ, Σ; X minψ µ, Σ; X ∆X (12) terior of a cone, which is normal to the cone spanned by all ( )!∆X 8 − ! ∆ ∈ the vectors x∗ 8∗ xˆ .Consequently,theworst-casecost nm !n " ∈ ( ) min log 2π log|Σ| at xˆ can be strictly decreased, if a sufficiently small step is ∆ xi 2 Γ − 2 ( )−2 taken along any directions within this cone, leading to ∥ ∥ ≤n 1 obs ∆ u 1 obs ∆ solutions that are more robust. All worst solutions, xi xi µ Σ− xi xi µ . + i 1 − 2 ( − − ) ( − − ) xˆ ∆x∗,wouldalsolieoutsidetheneighborhoodofthe + (! updated solution. Therefore, x∗ can be considered a robust Note, the objective function and constraints are sepa- local minimum, if there exists no descent direction for the rable in ∆xi. Thus, it suffices to solve robust problem at x x∗.TheproofofTheorem1aswell ! as empirical evidence of the robust optimization method’s 1 obs u 1 obs min x ∆x µ Σ− x ∆x µ (13) ∆ i i i i behavior is discussed in Bertsimas et al. (2010b). xi Γ − 2( − − ) ( − − ) ∥ ∥2≤ Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 5 for each i 1, 2, ..., n. The objective function of prob- from the nominal optimum, this iterative algorithm lem (13) can! be written as lowers the worst-case performance g successively. At each step, g is computed for the corresponding estimates 1 obs u 1 obs within the uncertainty set, and the inverse temperature x ∆xi µ Σ− x ∆xi µ − 2( i − − ) ( i − − )! is determined. The Boltzmann weight assigned to the u current estimates are then compared for a trial estimate. 1 u 1 1 obs ∆x Σ− ∆xi Σ− x µ ∆xi − 2 i +[ ( i − )] If the trial estimates lead to a lower g, they will become the estimates of the next step. Otherwise, they will most

1 obs u 1 obs likely be rejected and new trial estimates will be gen- x µ Σ− x µ . − 2( i − ) ( i − ) erated and compared. We refer interested readers to Bertsimas and Nohadani (2010) for further details. For Problem (13) is a trust region problem that has been completeness, however, we summarize the steps of the solved in the literature, as by Boyd and Vandenberghe algorithm along with the notation in Section 2 of the (2004)andRendlandWolkowicz(1997). In another online supplement. work, Ye (1992)demonstratedthatforsuchprob- lems, a hybrid algorithm that combines Newton’s 2.3. Distributional Robust Maximum method and a binary search can solve the problem in Likelihood Estimators O log log 1/ϵ iterations, with error ϵ. In the online ∆ ( ( ( ))) In some cases, the errors xi are independent of the supplement, Section 1, we briefly describe the steps for samples and among each other, and they follow the computing this inner problem. Overall, the algorithm same distribution P.Forexample,whenthereisa to compute the robust and normal distributed esti- fixed error caused by misalignments of the equip- mators, as defined in problem (5), can be summarized ment, we can consider them as persistent errors that as follows: lend themselves to a distributional description. For 1. Initialize with some estimators θ µ , Σ u. this, let fP ∆x be the probability density function for ∆![ ] ( ) 2. Solve problem (13) to obtain xi∗ θ for each P.Then,xobs follows a distribution, whose density is i 1, 2, ..., n. ( ) i ! ∆ ∆ the convolution 3. Use the worse-case errors X∗ θ x1∗ θ , u ( )![ obs( ) ∆x∗ θ , ..., ∆x∗ θ to calculate φ θ ψ θ; X 2( ) n( )] ( )! ( − obs ∆ ∆ ∆ obs ∆ ∆ ∆X = f θ; xi x fP x d x f θ; xi x dP x ∗ θ and Equation (10) to compute its derivative φ. ( − ) ( ) ! ( − ) ( ) 4.( Construct)) a Q such that Q · θ 0. + + ! obs ∆ 5. Compute =φ as the projection of =φ onto the E∆x~P f θ; xi x . ! [ ( − )] · subspace Q θ 0 using the kernel of Q. (14) 6. Update θ !using* the descent direction given by =φ (preserves Σ S+). ∈ Following the paradigm of distributional robust opti- 7. Stop when the norm of the derivative is smaller* mization (Delage and Ye 2010), we consider all possible than some tolerance parameter ϵ; otherwise, iterate distributions to have a bounded support 6 and the back to Step 2. DRE to be the solution to Following the result of Theorem 1, this algorithm provides the local robust maximum likelihood estima- n obs ∆ ∆ tors. Furthermore, the convergence of the algorithm is max min logE xi~P f θ; xi xi . (15) θ P: supp P 6 [ ( − )] linear, because it is a first-order method using gradient ( )! i 1 (! descent. Second-order methods are computationally ∆ Because all distributions share 6, problem (15) can be inefficient, because the inner derivative =θ xi∗ θ com- plicates the calculation of the second derivate( of φ(.Wenow)) reformulated as discuss an alternative method to obtain the global optimal n estimators in the presence of errors in the input data. max min log f θ; xi ∆x , (16) ∆x 6 θ i 1 ( − ) ∈ ! 2.2.4. Global Optimal Estimator. The robust normal ( distribution estimators can also be calculated using which has the same structure as problem (5), and thus a global search method. The global search method is can be solved in a similar fashion as that for the ARE, based on the robust simulated annealing algorithm using a gradient descent algorithm. introduced by Bertsimas and Nohadani (2010). To fol- low the original robust simulated annealing methods, 2.3.1. The Case of Multivariate Normal Distribution. we recast the maximization problem max φ θ; Xobs as Analogous to AREs, we demonstrate the performance ( ) a minimization problem, ming, with g φ. Starting of DREs by the specific assumption of a multivariate ≡ − Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 6 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS normal distribution. The inner minimization of prob- in the samples, we generate errors following both a lem (16) is normal and a uniform distribution, and use them to contaminate our samples. min ψ µ, Σ; Xobs 1 ∆xu ∆ n More specifically, the experiments are conducted in x 2 Γ ( − ∥ ∥ ≤ " the following fashion. A number of n 400 samples in nm n 4 ! min log 2π log Σ R are generated randomly, following the multivariate ! ∆x Γ − 2 ( )−2 | | ∥ ∥2≤ normal distribution with some random mean and some n (17) random . Let Xtrue be the 400 × 4 1 obs ∆ u 1 obs ∆ xi x µ Σ− xi x µ matrix containing the samples xtrue, i 1, 2, ..., 400, in + i 1 − 2( − − ) ( − − ) i ! true ! nm( n its rows. The vectors xi are the true samples and are log 2π log Σ not affected by errors. !− 2 ( )−2 | | u Furthermore, we generate errors on the samples in n ∆ u n 1 1 obs the following way: Xk, k 1, 2, ..., 40, is a 400 × 4 min ∆x Σ− ∆x Σ− x nµ ∆x ∆ i ! + x 2 Γ. − 2 + 01./i 1 − ) matrix containing errors corresponding to the samples ∥ ∥ ≤ ,- ! true ∆ ( in the 400 × 4 matrix X . The errors in Xk follow a n 1 obs u 1 obs normal distribution with mean 0 and covariance matrix xi µ Σ− xi µ , 4 + i 1 − 2 ( − ) ( − ) I4, where 0 is the zero vector in R , and I4 is the 4 × 4 (! identity matrix. In the experiments, we will use the where 1n is a size n vector of ones. Note that prob- parameter ρ to scale the magnitude of contamination. lem (17) is a trust region problem (online supplement, Correspondingly, we also employ ρ to tune the un- Section 1.2). In the maximization of problem (16), the certainty set size. In this context, Γ is used for a constant descent direction is projected into the space of positive set size, as will be discussed in Section 3.2. semidefinite matrices. Furthermore, problem (17) is a For the uncertainty set that contains the simulated ∆ special case of problem (12) in that all xi are restricted errors, we evaluate the performance of the estimators to be the same. This implies that DREs are less con- using the worst-case and average values of the prob- servative than AREs. ability density, as well as their distance from the value of the nominal estimator on the true data. Initially, we 2.4. Discussion use the normally distributed errors. Later, we compare When using data for , overfitting the results with the case of uniformly distributed errors. has been a central challenge. A variety of machine In each case, we consider both AREs and DREs. learning algorithms have been developed to overcome The experimental section is organized as follows. these issues, particularly via regularization techniques. In Section 3.1,weevaluatetheestimatorsbasedon The paradigm of robust optimization provides a uni- the worst-case and average values of the probability fying justification for the success of many of these density. In Section 3.2,weevaluatetheestimatorsbased methods. The robustness of a result can be associated on their distance from the nominal estimator on the true with problem attributes such as consistency and spar- data. In Section 3.3,wecomparetherobustestimatorsto sity (see Sra et al. 2012 and references within). In this the nominal one, using the local and the global search vein, our presented robust maximum likelihood esti- methods for the ARE and DRE cases. In Section 3.4,we mators are also robust against overfitting, as the inner discuss the effects of different distributions for the errors, minimization problem inherently reduces the complex- in particular, when we have uniformly distributed errors. ity and improves the predicative power of the esti- mator. Both of the following numerical experiments 3.1. Worst-Case and Average Probability Density support this observation, as the resulting estimators are To probe the efficiency of the proposed robust estimators, broadly insensitive to error size and sample size. we will check the worst-case and average values of the probability density, as we add properly scaled errors ∆ 3. Computational Results with from the set of errors Xk to the true values of the data. Simulation Data In particular, we calculate the AREs for the true data To evaluate the robust estimators, we conduct exper- Xtrue, for varying sizes of the assumed uncertainty set, iments using computer-generated random data. The as defined in (4), between ρ 0 and 3 with a step size of ! true ˆ true purpose of using simulated data is that we have ac- 0.1. We denote them by µˆ rob.a. X , ρ and Σrob.a. X ,ρ . curate information about the true data that can serve as Moreover, we calculate the( DREs) in the same( cases) true ˆ true a reference, allowing us to directly measure the quality and denote them by µˆ rob.d. X ,ρ and Σrob.d. X ,ρ . of robust estimators on observed data. We generate For ρ 0, we have the nominal( estimates,) which( co-) ! true samples following a multivariate normal distribution. incide with the true estimators, denoted by µˆ true X As our estimators are designed to deal with errors and Σˆ Xtrue . To calculate the robust estimators,( ) true( ) Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 7 we use a first-order gradient descent method. The initial holds for both the worst-case value and the average point is the robust estimate for the previous value of ρ, value of the probability density. This is because the within the considered ρ sequence. For each computed robust estimator always protects against the worst- estimate µˆ , Σˆ , we determine the log-likelihood density case scenario; thus, it does not degrade as fast as the true ∆ of the observed samples ψ µˆ ,Σˆ ,X αρ Xk , k 1, nominal estimators for increasing error sizes. 2,...,40. We record the worst-case( value+ as well) as! the average value over the set of errors indexed by k to rule 3.1.2. DRE Case. Figure 1 (bottom row) shows the out data-specificartifactsinourobservations.We performance of log-likelihood density function ψ as a consider the cases α 0.5, α 1.0, and α 1.5. function of the error size in the DRE case. We observe ! ! ! a similar behavior as in the ARE case. The difference 3.1.1. ARE Case. Figure 1 (top row) shows the results between the DRE and the nominal grows at a higher in the ARE case. The worst-case and average values of rate than the difference between the ARE and the nom- the log-likelihood density function are plotted over the inal. In all cases, the superiority of the robust estimator size of the perturbation ρ. For better comparison, these is detected for values of ρ greater than or equal to 1. The values are normalized to the corresponding unper- ARE is better than the nominal up to a factor of 10%, turbed values. We observe that for small values of ρ, the and the DRE is better than the nominal up to a factor of nominal and robust estimators depict the same per- 15%. As α increases, both nominal and robust per- formance. As ρ grows, however, the difference between formances deteriorate at a higher rate. them increases, with the robust always showing a Note the smooth transition between the nominal better performance than the nominal. This observation (i.e., ρ 0 in ψ µˆ , Σˆ , Xtrue αρ∆X ) and the robust ! ( + ) Figure 1. (Color online) Comparison: (Left) Worst-Case and (Right) Average Log-Likelihood ψ for Changing Perturbation Size ρ for the (Top) ARE and (Bottom) DRE Cases Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 8 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS log-likelihood (ρ > 0) for both the ARE and DRE cases, Note that A fro is the Frobenius norm of an n × m suggesting the identifiability of estimators. matrix A de∥fined∥ by

3.2. Distance From Nominal Estimators n m 2 To evaluate how robust estimators perform in the A fro Ai, j, ∥ ∥ ! 433333333333333i 1 j 1 presence of errors, we compute both the nominal and (! (! the robust estimators on contaminated data with errors where Ai j is the i, j element of matrix A. We av- of different sizes and compare them to nominal esti- , ( ) mators computed on the true data. erage the calculated distances over the error sets k, ˆ true k 1, 2, ..., 40. We use the Frobenius norm because it More specifically, we compute the AREs µrob.a. X ! ∆ true ∆ ( + takes into consideration the differences of the δ Xk, ρ , Σˆ rob.a. X δ Xk, ρ on the contaminated data for) error sets( k + 1, 2, ...), 40, for the values of δ of the variables as well as their cross correlation terms. 0, 1 with 0.05 steps,! and for the value of ρ 0, 1 with! The range of uncertainty size (ρ) where the robust 0[ .1] steps. We also compute the DREs in![ the] same estimator outperforms the nominal (true) depends on δ, fashion. For ρ 0, we have the nominal estimators. For the size of the errors, illustrated in Figure 2. When each estimate,! Figure 2 shows the calculated distances encountered errors are small sized, the robust solutions coincide with their nominal counterparts; that is, the true ∆ true Errorµ µˆ rob X δ Xk, ρ µˆ true X 2, (18) difference between the robust and true estimators is in- ! + − dependent of ρ. For a range of error sizes, the observ- 2ˆ ! true ∆ " ˆ ! true "2 ErrorΣ Σrob X δ Xk, ρ Σtrue X . (19) able dip demonstrates that robust estimators clearly ! 2 + − 2fro 2 2 2 ! " ! "2 Figure 2. (Color online) Range of Effectiveness of Robust Estimators as the Distance to True Estimators: (Left) Equation (18) for µ and (Right) Equation (19) for Σ for the (Top) ARE and (Bottom) DRE Cases Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 9 improve in accuracy. As δ increases, the interval with that covers the entire parameter space, belonging to the improved performance moves to higher ρ. This is at hand. For each run, the performance is again explained by the fact that the robust estimator is se- evaluated by measuring the distance to the nominal cured against errors with the norm up to some ρ, and estimators on the true data. Furthermore, for each thus it cannot deal with higher errors. In data-driven initial point, all performances are averaged over the applications, the analysis of the dip can be used re- set of the errors used. versely to motivate the appropriate size of the uncer- Figure 3 depicts the comparison between the local and tainty set Γ ρ∗. the global search algorithms in the ARE case. For ! completeness, we also show the DRE case, which fol- 3.3. Comparison of the Local and Global Estimators lows the same trend. These results show that as δ grows, In this section, we compare our gradient-based method the performance of all estimators deteriorates, but the with the global search algorithm. Both methods are deterioration has a smaller rate in the case of the robust employed to evaluate the ARE, as defined in prob- estimators. The DREs show a slightly better perfor- lem (5). Because the inner problem is the same, we con- mance than the AREs. This is because the errors for the tinue using the same algorithm for the inner problem, samples in the ARE case are not correlated, whereas all as described in Section 2.2, to enable a quantifiable sample errors in the DRE case follow the same perturbed comparison. distribution and, thus, are less conservative. For small Using the same data set as before, we compare the size errors, both the local and the global as well as the corresponding robust estimators to the nominal (true) ARE and DRE cases perform similarly. On the other counterparts for various values of δ, analogous to the hand, the robust simulated annealing algorithm out- previous . We evaluate both estimators on performs the gradient-based local search algorithm for data contaminated with errors that are scaled with δ. δ > 0.3, despite the 100 initial points for the local search. For values of δ ranging between 0 and 1 with a step However, the global estimators come at a higher com- of 0.1, we compute the nominal (true) and the robust putational expense. Whereas the gradient ascent method true ∆ estimates on the contaminated data X δ Xk . From requires approximately 100 seconds on a Intel Xeon 3.4 the region of best performance of the( robust+ estimator,) GHz desktop machine, the simulated annealing method as illustrated in Figure 2, we extract the best parameter terminated after approximately 215 seconds for one value ρ∗ that yields the lowest robust estimate errors for a particular instance of the samples. Therefore, for small- particular δ,asdefined in Equations (18)and(19). This ρ∗ sized errors, it is more meaningful to employ the gradient is used in the experiments to evaluate the performance. ascent method to evaluate local robust estimators. On the To come close to a global optimum, we conduct the other hand, when errors are larger, using the more costly gradient-based local search from 100 different initial simulated annealing method leads to better-performing estimates. These are chosen as nodes of a uniform grid global robust estimators.

Figure 3. (Color online) Comparison of the Errors in (Left) µ and (Right) Σ in the ARE and DRE Cases Using the Local and Global Robust Algorithms for Different Levels of Contamination δ

Note. All errors follow a normal distribution. Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 10 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS

3.4. Comparison Between the Error Distributions and equal to zero. For this strictly pa- In this section, we investigate whether the above ob- rametrized setting, we estimate the parameters by the servations and, thus, the quality of the proposed robust proposed robust MLE method and by conventional estimators depend on the source of errors. For this, we . In this experiment, we scale the results conduct the same experiments using a different error in terms of b for a direct comparison. distribution, namely, when errors are uniformly dis- We observe that both methods comparably estimate tributed. Now, ∆X , k 1, 2, ..., 40, is a 400 × 4 matrix, the variances to an accuracy of ± b/10. In fact, the es- k ! where each of its rows follows the uniform distribution timator provided by robust maximum likelihood esti- in a ball with radius 1. mation improves slightly over factor analysis within an When measuring the distances to the true estimators, uncertainty size window, similar to the observations in as in (18) and (19), we observe that for the same δ, the Section 3.2 and displayed in Figure 2. On the other hand, region where the robust estimator is superior is shifted factor analysis outperforms robust maximum likelihood to smaller values of ρ (not shown here, but comparable estimation in estimating the covariances. The Frobenius to Figure 2). This is because the samples from a uniform norm of the difference between the estimated and gen- distribution are contained in the ball with radius δ, erated covariance matrices is 0.6b smaller for factor whereas normally distributed samples can reside outs- analysis than robust maximum likelihood estimation, ide this region. Overall, we observe similar trends for demonstrating the advantage of the parametric the ARE and DRE cases as in the case of the normally model. However, this advantage deteriorates when an distributed errors, as exemplified in Figure 4. There- incorrect parametric factor analysis model is used. fore, we can conclude that robust estimators are also robust to the source of uncertainty. Because we do not 3.6. Summary of Simulation Results expect any additional insight by a comparison with In our computer simulations, we applied the local and global estimates, we omit the discussion here. the global RO methods to randomly generated data sets that were contaminated in two different ways: 3.5. Comparison with Factor Analysis the ARE case and the DRE case. First, we investigated In this section, we compare the proposed method the performance of the log-likelihood density function with an existing method. We select an extreme case, values ψ as the error size increases. Using the local where the distribution of the data uncertainty is fully gradient-based robust method to estimate the param- accounted for. Consider the observed data Xobs x , x eters, we observed that ψ degrades at a much lower rate !( 1 2) to be generated by the following factor analysis model: than when using the nominal estimators, demonstrating that the robust method protects against worst-case sce- x xtrue ∆x and x b · xtrue ∆x , 1 ! + 1 2 ! + 2 narios. This behavior was confirmed by both the worst- where xtrue is normal distributed 1 0, 1 . The uncertain case performance and the average performance of ψ,as ( ) true ∆x1 and ∆x2 are independent of each other and of x well as for both cases, the ARE and DRE cases. and are also normal distributed but with variances < 1, Next, we compared the local and global robust al- respectively. This Xobs follows a bivariate nor- gorithms by computing the distance of the robust es- mal distribution with zero mean, variances of 1 and b2, timates to their “true” and unperturbed counterparts.

Figure 4. (Color online) Comparison of the Errors in (Left) µ and (Right) Σ for Uniformly Distributed Errors Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 11

Obviously, as the size of error increases, this distance the conclusions less dependable. Our goal is to grows for all methods. However, because this distance demonstrate that robust estimators are more accurate in grows linearly for the nominal method, we observe the presence of uncertainties. They can produce more a clear advantage using the robust methods. Because reliable guidelines that can be followed in practical set- the DRE case is less conservative, it shows a slightly tings and prevent undesirable deviations. better performance than the ARE. As we encounter In this section, we focus on radiation dose to tumor larger errors, the robust simulated annealing method structures and defer the discussion on other organs to finds the global robust optima and, thus, outperforms all a more dosimetric study. In clinical practice, spatial other local methods. This comes at a higher computa- dose distributions are evaluated with a cumulative tional cost. Independent of approach, we also studied dose–volume (DVH). It measures the por- the range of effectiveness for robust estimators, because tion of the volume (e.g., of tumor) that receives a certain for small-sized errors, the robust and nominal estimates fraction of the prescribed dose. Figure 5 illustrates two coincide, and for too large errors, any method fails. acceptable tumor DVHs. The ideal distribution is to We also probed the sensitivity of the estimators deliver 100% of the prescribed dose to the entire vol- with respect to the sources of errors. Using the local ume (dotted step function). Note that the DVH, by robust method, we compared two DRE cases: (a) when design, normalizes over anatomical sites and volume the errors follow a normal distribution and (b) when differences, enabling direct comparisons across pa- the errors follow a uniform distribution. We show that the tients. Guidelines are typically imposed as DVH con- proposed estimators remain robust even when the trol points, for example, for the dose Dx to a tumor. The errors follow a different distribution than the one we quantity Dx measures the percentage of the prescribed prepared for. dose that was received by x% of the volume and is Last, we compared the robust MLE method against typically implemented as a soft constraint during plan a fully parametric model of factor analysis. We observe optimization. Figure 5 also shows three such control that the semiparametric robust MLE method compares points. Despite international and institutional recom- well with factor analysis for estimating variances but mendations, the delivered values often differ from underperforms for covariances. This demonstrates that protocols (Nohadani et al. 2015). although robust maximum likelihood estimation of- As input data, we employ Dx control points of 491 fers advantages to estimates of uncertain input data, treatment plans that have already been delivered. it underperforms when the data-generating process is Specifically, the treated tumors are in the abdomen, known and can be leveraged. brain, bladder, breast, eyelid, esophagus, head and neck, liver, lung, pancreas, parotid gland, pelvis, 4. Cancer Radiation Therapy prostate, rectum, thyroid, tongue, tonsil, and vagina. Plan Evaluation This range serves to limit site-specific bias, and the Intensity-modulated radiation therapy has the advan- use of the DVH enables comparability. Note that all tage of delivering highly conformal dose distribu- treatments followed the same clinical protocols and tions to tumors of geometrically complex shapes. The were optimized with the same commercial software treatment-planning process involves iterative interac- tions between a commercial software product and a Figure 5. (Color online) Dose–Volume Histogram of Two group of planners, dosimetrists and medical physicists. Clinically Acceptable Treatment Plans for Tumor Therefore, the final product is the unpredictable outcome of a series of trial-and-error attempts at meeting com- peting clinical objectives.Toguidethedecisionmaking and to assure plan quality, institutional and interna- tional recommendations are followed (International Commission on Radiation Units and Measurements 2010). Even though these constraints are rigorously imposed, substantial deviations can occur (Das et al. 2008, Roy et al. 2013). These often occur because some of the guidelines cannot be followed because some of them are competing or infeasible in certain cases. In practice, these deviations are statistically analyzed for both reporting and process control purposes. For this, conventional statistical estimators (sample mean and covariances) are used. However, these estimators are Note. The dotted line is a guide for the eye for the ideal plan, and the sensitive to sample quality and sample size, rendering dashed lines mark three dose control points Dx. Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 12 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS

(Pinnacle 2012). Therefore, the remaining sources of displays sample dependence. In both experiments, the uncertainty are in the variations among the patients independence among patients justifies the assumption and the path of decision making that clinicians have of a normal distribution for the deviations from pro- taken to arrive at these final plans. We compute the tocol. Therefore, the corresponding uncertainty set can corresponding AREs for these plans. be modeled as in (4). Figure 6 (left) shows that for the three key clinical

4.1. Summary of Results control points, the spread of µrob.a. is stable and at the We evaluate DVH dose points D , D , D , D , and same level as the nominal estimator (at ρ 0) up to 100 98 95 50 ! D2 on the treated tumors. Specifically, we simulate ρ 2 Gy. Beyond this point, the performance de- a typical institutional analysis, where means and co- teriorates≤ (also observed in Section 3.2). However, with variances of Dx are reported. A report on a protocol (not regard to sample-size dependence, µnom exhibits a sizable ∆s data) is considered reliable when the estimator values sensitivity (seen by the difference |std µnom s ! ( )− remain stable for differing sets of data. In other words, std µnom | > 0), as opposed to µrob.a., which remains insensitivity to sample and sample size is given, when unchanged( ) (hence, not plotted). This observation applies estimators exhibit a narrow spread over the sampled to all DVH control points Dx. Furthermore, a deviation data (and size). In the following, we describe two ex- beyond 2 Gy is typically considered clinically unac- periments, probing the dependence on samples and on ceptable during planning. Therefore, it is justified to sample size. state that the stability of the robust estimator is superior To simulate semiannual report scenarios, we ran- over the clinically relevant range. domly divide the 491 five-dimensional plan data into The true advantage of the robust method becomes a subset of 50. For each set, we evaluate estimators for apparent when comparing higher estimators. Figure 6 µ and Σ. First, we compute the corresponding AREs (right panels) illustrates the coefficient of variation σ of 3,000 such samples. Next, the standard (nominal) (cv µ) that measures the normalized spread of the estimators on the same samples are calculated for covariances.! With regard to the sample dependence, the comparison. In the first experiment, an estimator is spread of Σ D , D and Σ D , D is com- rob.a.( 50 2) rob.a.( 95 50) considered superior when its value is not sizably af- parable to Σnom over all samples (see Figure 6, (d)and(f)). fected by the choice of the sample, that is, the estimator However, Σrob.a. D95, D2 ,whichmeasurestherelation spread is narrower. In the second experiment, only 80 across the full DVH( range) and is clinically most mean- randomly chosen samples (each having 50 plans) are ingful, exhibits a narrower spread over the entire region, considered, and the results are denoted by superscript “s.” as shown in Figure 6(e)), demonstrating the superi- Therefore, the difference fromthesubsampledestimator ority of Σrob.a. for this key metric. Note that only

Figure 6. (Color online) Stability of Estimators: Dependence on the Size of Uncertainty Set ρ

σ Note. The figure shows the of the mean (left) and coefficient of variation (cv µ) of the covariance matrix elements over the samples (right). ! Bertsimas and Nohadani: Robust Maximum Likelihood Estimation INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS 13

Σrob.a. D95, D2 (Figure 6(e)) is inferior for ρ 0.8, which the robust estimators are significantly more protected is expected,( as) the performance of any robust≤ quantity against errors than nominal ones. For small errors, the sets in beyond a specific uncertainty size, also demon- local and global robust estimators are comparable. strated in Figure 2 with computer-simulated data. When However, for larger errors, the global estimators out- probing the sample-size dependence, a sizable variation perform the local ones. Moreover, we show that the is observed for the nominal estimator. Furthermore, all proposed estimators remain stable even when errors matrix elements of Σrob.a. display a near constant spread follow a different distribution than assumed. over an extended range of uncertainty size ρ. Also in this In the second application, a large data set of cancer application, we observe a smooth transition between the radiation therapy plans that have been delivered serve nominal and the robust estimators, supporting the to probe the estimators in clinical decision making. We identifiability of estimators. show that whereas conventional estimators lead to heavily sample-dependent conclusions, the robust es- 4.2. Clinical Implications timators exhibit a narrow spread across samples. This The mean of dose points Dx is often used to track both sample independence allows for reliable decisions. planners and the institutional performance. It is also Given the independence of possible data structure and used to analyze outcome data (survival, toxicity, and the generic error models, we believe that our proposed pain levels), both for monitoring and for reporting. methods are directly applicable to a wide variety of Here, we presented three dose points for the sake of real-world maximum likelihood settings. exposition (the other two revealed qualitatively com- parable results). The covariance between dose points is Acknowledgments often recorded and analyzed to track the efficacy of the The authors thank Apostolos Fertis for insightful discussions guidelines. In fact, in a recent study using nominal and Indra Das for generously providing the clinical data. estimators on 100 head-and-neck cases, D95 was found to be negatively correlated to D50 (Nohadani et al. 2015, References Roy et al. 2016). This means that these two criteria, Ben-Tal A, El Ghaoui L, Nemirovski A (2009) Robust Optimization although recommended, cannot be satisfied concur- (Princeton University Press, Princeton, NJ). rently. Robust estimators are suited to providing more Berkson J (1950) Are there two regressions? J. Amer. Statist. Assoc. reliable recommendations. 45(250):164–180. In medical research and practice, statistical estima- Bertsimas D, Nohadani O (2010) Robust optimization with simulated tors are arguably among the most prevalent quanti- annealing. J. Global Optim. 48(2):323–334. Bertsimas D, Brown D, Caramanis C (2011) Theory and applications tative tools. These results show that taking errors into of robust optimization. SIAM Rev. 53(3):464–501. account with a robust approach can significantly ad- Bertsimas D, Nohadani O, Teo K (2010a) Nonconvex robust opti- vance the reliability of conclusions. Furthermore, in mization for problems with constraints. INFORMS J. Comput. clinical trials, large samples are required to control 22(1):44–58. errors. The presented robust approach can significantly Bertsimas D, Nohadani O, Teo K (2010b) Robust optimization for mediate the sample size and cost while ensuring the unconstrained simulation-based problems. Oper. Res. 58(1): 161–178. same, if not a better, level of reliability of results. Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK). 5. Conclusions Buonaccorsi JP (2010) Measurement Error; Models, Methods and Ap- In this work, we extend the method of maximum plications (CRC Press, Boca Raton, FL). likelihood estimation to also account for errors in input Calafiore G, El Ghaoui L (2001) Robust maximum likelihood esti- data, using the paradigm of robust optimization. We mation in the . Automatica 37(4):573–580. Cheng CL, Van Ness JW, et al. (1999) Statistical Regression with introduce two types of robust estimators to cope with Measurement Error (Oxford University Press, New York). adversarial and distributional errors. We show that Danskin JM (1966) The theory of max-min, with applications. adversarial robust estimators can be efficiently com- SIAM J. Appl. Math. 14(4):641–664. puted using a directional gradient–based algorithm. In Das I, Desrosiers C, Srivastava S, Chopra K, Khadivi K, Taylor M, the distributional case, we show that the inner infinite Zhao Q, Johnstone P (2008) Dosimetric variability in lung cancer IMRT/SBRT among institutions with identical treatment plan- dimensional optimization problem can be solved via a ning systems. Internat. J. Radiation Oncology Biol. Phys. 72(1): finite dimensional problem, constituting a special case 604–605. of the adversarial estimators. For multivariate normally Delage E, Ye Y (2010) Distributionally robust optimization under distributed errors, arising in many practical cases, we uncertainty with application to data-driven problems. develop local and global search algorithms to effi- Oper. Res. 58(3):595–612. ciently calculate the robust estimators. El Ghaoui L, Lebret H (1997) Robust solutions to least-squares prob- lems with uncertain data. SIAM J. Matrix Anal. Appl. 18(4): We demonstrate the performance of the robust es- 1035–1064. timators in two types of data sets. Computer-simulated Fuller WA (2009) Measurement Error Models, vol. 305 (John Wiley & data serve for a controlled experiment. We observe that Sons, Hoboken, NJ). Bertsimas and Nohadani: Robust Maximum Likelihood Estimation 14 INFORMS Journal on Computing, Articles in Advance, pp. 1–14, © 2019 INFORMS

Hampel FR (1974) The influence curve and its role in robust esti- Pfanzagl J (1994) Parametric (Walter de Gruyter, mation. J. Amer. Statist. Assoc. 69(346):383–393. Berlin). Huber PJ (1964) Robust estimation of a . Ann. Rendl F, Wolkowicz H (1997) A semidefinite framework for trust Math. Statist. 35(1):73–101. region subproblems with applications to large scale minimiza- Huber PJ (1980) Robust Statistics (John Wiley & Sons, Cambridge, MA). tion. Math. Programming 77(1):273–299. Huber PJ (1996) Robust Statistical Procedures (SIAM). Roy A, Das I, Srivastava S, Nohadani O (2013) Analysis of planner International Commission on Radiation Units and Measurements dependence in IMRT. Medical Phys. 40(6):260. (2010) Prescribing, recording and reporting photon-beam intensity- Roy A, Das IJ, Nohadani O (2016) On correlations in IMRT planning modulated radiation therapy (IMRT). ICRU Report-83, Inter- aims. J. Appl. Clinical Medical Phys. 17(6):44–59. national Commission on Radiation Units and Measurements Sra S, Nowozin S, Wright S (2012) Optimization for Machine Learning 10(1). (MIT Press, Cambridge, MA). Jaeckel LA (1971) Robust estimates of location: Symmetry and Ye Y (1992) A new complexity result on minimization of a quadratic asymmetric contamination. Ann. Math. Statist. 42(3):1020–1034. function with a sphere constraint. Floudas C, Pardalos P, eds. Nohadani O, Roy A, Das I (2015) Large-scale DVH quality study: Recent Advances in Global Optimization (Princeton University Press, Correlated aims lead relaxations. Medical Phys. 42(6):3457–3457. Princeton, NJ), 19–31.