A SAS* Macro System David S. Frankel, Exxon Company, USA Abstract the Statistical-Model Toolbo
Total Page:16
File Type:pdf, Size:1020Kb
The Statistical-Model Toolbox: A SAS* Macro System David S. Frankel, Exxon Company, U.S.A. Abstract However, if the objective is to estimate the expected value of • nonlinear function of the The statistical-model toolbox (SMT) is a SAS predicted variable and if the scatter in the macro system written in the Production Depart sample is Significant, the conventional approach ment of Exxon Company, U.S.A_, that provides two can lead to significant errors. In this case; powerful capabilities: a systematic way to model it is preferable to model the popUlation in scattered data and to model calculated results terms of conditional probability-density 'that are based on the scattered data; and, a way functions (PDF's) that are determined by central to create and manipulate synthetic probabil ity tendency ("1ocationll) and by variance ("scale"). distributions in the absence of measured data. PDF's are also referred to as distributions or The Production Department uses these capabi1 i statistical models. ties to address problems in petroleum reservoir description, where rock properties are Figure 2 depicts this statistical-model stochastic by nature. However, the tools are approac~. In a procedure analogous to complete 1y general and can be aPfl i ed to any regressl0n t the location and scale parameters continuous, numeric, random variab es. The most are estimated for normal (Gaussian) PDF's. The frequently used tool calculates expected values expected value of any function of the variable of arbitrary functions of one or two random is calculated by integrating the product of the variables. Other tools display distributions, function and the PDF over the entire range of calculate statistics, and generate random variable. Computationally, this is simply the samples. Because it is organized as a set of probability-weighted average value of the the tools, the SMT provides sufficient flexibility function. to handle a wide range of applications. A minimum of programming expertise and effort are The statistical-model approach is especially required to use the tools, and the SMT verifies valuable in petrophysics because key physical all input specifications to provide a high level variables are 10gnorma11y distributed. That is, of fault tolerance. in the conventional approach, the most likely value of the logarithm of the variable is predicted. It is not adequate to directly Background substitute the antilog of this prediction into expressions for funcUons that depend on the Geologists and petroleum engineers who attempt variable. to quantitatively describe petroleum reservoirs must deal with problems Similar to those in The statistical-model method .150 has advantages other disciplines: predicting values of one and drawbacks. Its main advantages are that it variable based on the measured value of a second provides a more complete description and more variable; and, evaluating functions of the accurate expected values of arbitrary functions predicted variable. of the variable compared with the conventional approach. This is to be expected because it The prediction problem is usually handled using retains more information about the sample. regreSSion teChniques, most often ordinary least squares, which predicts the most likely value. Its drawbacks are that it assumes the PDF's are Once regression has been completed, the raw data normal, that these normal PDF's are unbounded in are discarded, leaving the regression curve as the range of values accessible to the variable, the sole model of the data. Funct ions of the and that the method is difficult to understand predicted variable are evaluated by substituting and implement. unique values as calculated from the equation of the regression curve. The assumption that samples of random-variable populations approximately conform to normal Figure 1 depicts this conventional approach, PDF's is generally a good one. In most cases which has advantages and drawbacks. where the assumption is not adequate, a simple transformation can be found to render the PDF's The advantages are that it is easy to of the transformed variable approximately understand, easy to automate, and easy to normal. In the rare case where even this is not display. Most people feel comfortable using pOSSible, the statistical-model approach could this approach. be handled in terms of a specific non-normal type of prohabil ity density ·function. Its main drawback is that the regression equation by itself is an incomplete model of the The SAS macro system called the Statistical population; it only predicts the central Model Toolbox (SMT) successfully overcomes the tendency of the data. This is not a serious other two drawbacks. The SMT does provide for drawback if the objective is to predict the truncation of the range of the random variable, expected value of the variable or of a linear and It makes the method simple to implement. function of the variable. The user need not be a statistician, although he must reorient his mental image of the process from predicting unique values to integrating over PDF's. 1304 The Statistical-Model Toolbox (SMT) grid-cell s into trapezoidal PDF's that approximate the distribution of elevations above The SKT is a system for managing univariate a datum in each cell. These PDF's are useful in PDF's for continuous, numeric, random variables. integrating functions that depend on depth. The SMT is a set of SAS version 5.16 name-style macros that uses only keyword parameters. The GENNORM and GENlNORM build hypothetical PDF's m.cros were written for IBM/OS and are normally (marginal or conditional) using specified invoked in the background environment. No statistics for normal and lognormal populations, user-interface has yet been provided to respectively. The statistics may be construct the macro invocations, but a high percent i 1es, the ari thmet i c mean, or (for degree of fault tolerance is provided by input GENlNORM) any arbUrary power-mean (e.g., verification utilities within the macro system. harmonic mean, geometric mean, root-mean-square, etc.) . Figure 3 is a conceptual flowchart for the SMT. The main elements of the system are: macros that The GEN macro makes PDF's using primitive model make PDF's; statistical-model datasets (SMDS's), parameters. Whereas PARMEST, GENNORM, and which serve as databases from which PDF~s can be GENlNORM make normal (Gaussian) PDF's, GEN can extracted and applied; and, application macros also make uniform (rectangular), generalized that use extracted PDF's to perform specific triangular (i.e., symmetric or asymmetric), tasks. trapezoidal, and spike (delta-function) PDF's. These non-normal PDF's may be used on an equal The SMT also includes auxiliary macros thatare basis with normal PDF's that are derived by used at either end of this legica1 flowchart, to parameter estimation. aid in the description of samples, or to look up answers in output datasets created by the application macros. Statistical-Model Datasets (SMDS's) Figure 4 is a detailed flowchart that shows the The core of the SMT is one or more actual macros. The following discussion statistical-model datasets (SMDS). An SMDS is a summarizes key macros and concepts. Sr.s dataset (normally permanent) that contains in canonical form all of the essential detail s about one or more statistical models. The user Making Models need not be concerned with the detailed contents of an SMDS. It is sufficient to know that PDF's Models (PDF's) can be made by performing can (and must) be extracted from an SMDS to be parameter estimation on samples or by creating used by an application macro. hypothetical models in the absence of data. Both classes of models accommodate double Each observation in an SHDS represents a truncation, single truncation, or no truncation marginal PDF or a class of conditional PDF's. in the permissible range of the random variable. Stored details include: the type (i.e., one of This is an important consideration when dealing the five available standard forms); the location with physical variables or with constructed parameter or an equation to generate it; the variables (e.g., ratios and percentages) whose scale parameter; the shape parameter (triangular values must be confined to known intervals. and trapezoidal types only); lower and upper truncation limits; and a parameter covariance The PARMEST macro estimates parameters for matrix if the PDF resulted from parameter normal PDF's by applying maximum likelihood estimation. estimation to samples. Both truncated and censored samples can be handled. A truncated One detail not stored by an SMOS is what sample has no values in a certain range (usually transformation if any was applied to the sample due to measurement or reporting problems) (or implied for the population) before capturing although these values are known to exist in the its description in the SMDS. The user must keep population. Censored samples contain inexact track of this transformation and be able to observations, which are effectively error bars invert it. rather than exact data. PARMEST can derivea marginal PDF from a univariate sample, or it can An SMDS can be the target of two macros: the derive a class of conditional PDF's from a QUERY macro, which lists relevant details to bivariate sample. For the latter case, either a allow the user to identify the contents of the linear or quadratic relation between the SMDS; and, the EXTRACT macro, which the user regressor and the location of the conditional invokes to prepare a specifiC univariate model PDF may be chosen, and the location maY for subsequent use by an application macro. optionally be restricted to match a specified value. Application Macros PARMEST embodies proprietary parameter estimation techniques; it is not just a shell The app1 ication macros do calculatinns or make that calls SAS statistics procedures. plots for either functions of one or two random variables nr for the PDF itself.