Data Driven Confidence Regions for Cost Estimating Relationships

Presented at the 2017 ICEAA Professional Development & Training Workshop www.iceaaonline.com/portland2017 Data Driven Confidence Regions for Cost Estimating Relationships Christopher Jarvis USAF AFLCMC EGOL/FZC [email protected] May 2, 2017 Abstract Typically, cost estimating models contain many uncertain parameters which drive the model output results. To characterize the model output distributions, Monte Carlo methods are em- ployed, which require accurate description of model parameter distributions and dependencies. In this paper we review the techniques to compute the parameter confidence regions for linear and nonlinear regression methods that can be used for this purpose. Finally, comments are made regarding the current practical implementations and limitations. 1 Introduction When building models in cost estimating as in many other scientific fields, an analyst begins with observed data and an assumed model form that governs the relationship of the independent and dependent variables. These models contain parameters that often represent specific physical properties and the model form of equations can be dictated by underlying dynamical processes. The desire is to determine the value of the parameters associated with the assumed model that provide the best fit in some measure between the model predictions and the observed data. Regression analysis, which is a ubiquitous tool for parameter estimation, identifies the parameters that minimize the sum of squared errors between the observed and predicted data points. In theory, there exist population parameters for the model type that determine all possible observa- tions up to the random error terms. Since the observed data is only a subset of the total population, the computed parameters are reflective of the sample and are not necessarily the true population parameters. Thus, given a set of parameters the secondary task is to quantify the uncertainty associated with the parameters computed from the sample data. Furthermore, in cost estimating applications, a model that encompasses all costs for a program may contain many uncertain parameters. The uncertainty of the model outputs can be assessed using a Monte Carlo simulation to numerically sample the inputs and capture the output distributions. The Monte Carlo process and the accuracy of the results relies on quantification of the parameter distributions and dependencies between them. Typically, distributions for all parameters are one dimensional distributions and some measure of correlation is used to force pairwise relationships between parameter inputs. This process may not accurately capture the underlying multidimensional distribution of parameters and could consequently produce incorrect output distributions and ultimately misleading results. The purpose of this paper is to review the methods for finding the parameter confidence regions for linear and nonlinear regression models. The methods will be applied to equations typical in cost 1 Presented at the 2017 ICEAA Professional Development & Training Workshop www.iceaaonline.com/portland2017 estimating applications. To support that goal, in section 2, an overview of nonlinear regression is presented. Joint confidence region computations are developed in section 3. In section 4, the cost estimating relationship models are introduced as well as numerical results. Finally, in section 5, the results of this paper, and some potential ideas for future work are summarized. 2 Review of Nonlinear Regression To facilitate the development of the later equations a review of nonlinear regression is presented using a matrix approach. For additional introduction and results see [1, 2]. The general form for all models considered here is yi = f (xi; β) + "i (1) N T p where xi is a vector of variables from R , β = (β0; :::; βp−1) is a vector of parameters from R , and f (xi; β) is the function relating the input variables to the output data yi using the parameters β and a random error term "i. When the function f (xi; β) is linear in β, the model is called linear. In some cases, the function is not linear in β, but can be made linear through a nonlinear transformation. Applying a logarithm to both the input variables and possibly the output data is one common transformation. When the function is linear or transformably linear in the parameters, the Ordinary Least Squares (OLS) regression to find the parameter values. When the function f (xi; β) is nonlinear in β or has no linearizing transformation an approach other than OLS must be used to solve for the parameter vector β. The most common approach of Nonlinear Regression (NLR) involves iterated linearization. The linearization process as the name suggests creates a linear model that under the right circumstances closely approximates the nonlinear model around the specified point. The process is described below. To find a least squares solution, recall the error equation is given by the difference between the observed data yi and the model predictionsy î, referred to as the fitted data, which are based on the parameter values. Hence, the error equation can be thought of a function of the independent variables and the parameter values given by "i (xi; β) = yi − yî = yi − f (xi; β) and the total sum of squared errors is N X 2 S (xi; β) = ("i (xi; β)) (2) i=1 N X 2 = (yi − f (xi; β)) : (3) i=1 (k) Assuming that f (xi; β) is sufficiently regular, we can expand f (xi; β) about a point β in the parameter space using a Taylor series expansion as p−1 (k) X @f (xi; β) (k) f (xi; β) = f xi; β + βj − βj + LT E @β (k) j=0 j β=β where LT E is the local truncation error associated with the first order Taylor Series approximation. The term β(k) in this context represents an iterative approximation for the true parameters. Using 2 Presented at the 2017 ICEAA Professional Development & Training Workshop www.iceaaonline.com/portland2017 the Taylor series approximation above and neglecting the higher order terms, f (xi; β) can be locally approximated as a linear function in β and (1) can be written using matrix operations as p−1 (k) X @f (xi; β) (k) yi ≈ f xi; β + βj − βj + "i @β (k) j=0 j β=β 2 (k) 3 β0 − β0 6 β − β(k) 7 (k) h @f(xi;β) @f(xi;β) @f(xi;β) i 6 1 1 7 = f xi; β + ::: 6 . 7 + "i: (4) @β0 @β1 @βp−1 β=β(k) 6 . 7 4 . 5 (k) βp−1 − βp−1 Stacking the equations for each of the data points yi we obtain the matrix system equation given by Y = F(k) + D(k) β − β(k) (5) or equivalently D(k) β − β(k) = Y − F(k) (6) where the error terms are included in the data vector Y and the model approximation F and the matrix D of partial derivatives called the Jacobian are given by 2 3 (k) 2 @f(x ;β) @f(x ;β) @f(x ;β) 3 f x1; β 1 1 ::: 1 6 7 @βk @β1 @βp−1 (k) 6 @f(x2;β) @f(x2;β) @f(x2;β) 7 6 f x2; β 7 ::: (k) 6 7 (k) 6 @βk @β1 @βp−1 7 F = 6 7 D = 6 7 : 6 . 7 6 . .. 7 6 . 7 6 . 7 4 @f(x ;β) @f(x ;β) @f(x ;β) 5 4 (k) 5 N N ::: N f xN ; β @βk @β1 @βp−1 β=β(k) When the number of data points is not equal to the number of parameters i.e. N 6= p, then the D matrix is non-square, and hence not invertible. When there are less data points than parameters (N < p), the system (5) does not have a unique parameter solution but rather an infinite number p of solutions through a p − N dimensional subspace of R . When there are more data points than parameters (N > p), the system is overdetermined and has either a single solution (in the case of perfect data) or does not have a solution. When there are equal numbers of distinct data points and parameters the system containing the error is solved exactly. The least squares solution minimizes the sum of squared errors and gives the exact solution, if it exists. In practical applications, for this reason, it is typical to have more data points than parameters with the hope that the least squares solution will provide a better approximation to the true population parameters by reducing the influence of the error. A mathematical generalization of the matrix inverse is called the pseudoinverse. Given an [n × m] matrix A, the pseudoinverse of A is defined by −1 Ay = AT A AT where AT is the transpose of the matrix. Then, for the overdetermined system Ax = z the least squares solution using the pseudoinverse is given by x = Ayz: (7) 3 Presented at the 2017 ICEAA Professional Development & Training Workshop www.iceaaonline.com/portland2017 To understand how the pseudoinverse gives the least squares solution note the residual (error) term for any vector x is given by " = z − Ax and the sum of squared errors is "T " = (z − Ax)T (z − Ax) : The least squares solution minimizes the sum of squared errors equation yielding a zero derivative, @ 0 = "T " @x = −2AT (z − Ax) : (8) Rearranging (8) we have AT Ax = AT z which is the matrix form of the normal equations. The term on the left side AT A is a square matrix and for at least m unique data points, is invertible. The solution is given by −1 x = AT A AT z which agrees with (7). Using the pseudoinverse, the system linearized about the point β(k) in (6) can be solved for a correction vector h iy δ(k) = D(k) Y − F(k) : (9) Therefore, starting from an initial guess β(k), we find a new estimate β(k+1) by computing β(k+1) = β(k) + δ(k): (10) This process is known as the Gauss-Newton step and is repeated until the termination criteria are met.

Load more