The Statistical-Model Toolbox: A SAS* Macro System David S. Frankel, Exxon Company, U.S.A.

Abstract However, if the objective is to estimate the expected value of • nonlinear function of the The statistical-model toolbox (SMT) is a SAS predicted variable and if the scatter in the macro system written in the Production Depart­ sample is Significant, the conventional approach ment of Exxon Company, U.S.A_, that provides two can lead to significant errors. In this case; powerful capabilities: a systematic way to model it is preferable to model the popUlation in scattered data and to model calculated results terms of conditional probability-density 'that are based on the scattered data; and, a way functions (PDF's) that are determined by central to create and manipulate synthetic probabil ity tendency ("1ocationll) and by variance ("scale"). distributions in the absence of measured data. PDF's are also referred to as distributions or The Production Department uses these capabi1 i­ statistical models. ties to address problems in petroleum reservoir description, where rock properties are Figure 2 depicts this statistical-model stochastic by nature. However, the tools are approac~. In a procedure analogous to complete 1y general and can be aPfl i ed to any regressl0n t the location and scale parameters continuous, numeric, random variab es. The most are estimated for normal (Gaussian) PDF's. The frequently used tool calculates expected values expected value of any function of the variable of arbitrary functions of one or two random is calculated by integrating the product of the variables. Other tools display distributions, function and the PDF over the entire range of calculate , and generate random variable. Computationally, this is simply the samples. Because it is organized as a set of probability-weighted average value of the the tools, the SMT provides sufficient flexibility function. to handle a wide range of applications. A minimum of programming expertise and effort are The statistical-model approach is especially required to use the tools, and the SMT verifies valuable in petrophysics because key physical all input specifications to provide a high level variables are 10gnorma11y distributed. That is, of fault tolerance. in the conventional approach, the most likely value of the logarithm of the variable is predicted. It is not adequate to directly Background substitute the antilog of this prediction into expressions for funcUons that depend on the Geologists and petroleum engineers who attempt variable. to quantitatively describe petroleum reservoirs must deal with problems Similar to those in The statistical-model method .150 has advantages other disciplines: predicting values of one and drawbacks. Its main advantages are that it variable based on the measured value of a second provides a more complete description and more variable; and, evaluating functions of the accurate expected values of arbitrary functions predicted variable. of the variable compared with the conventional approach. This is to be expected because it The prediction problem is usually handled using retains more information about the sample. regreSSion teChniques, most often ordinary least squares, which predicts the most likely value. Its drawbacks are that it assumes the PDF's are Once regression has been completed, the raw data normal, that these normal PDF's are unbounded in are discarded, leaving the regression curve as the range of values accessible to the variable, the sole model of the data. Funct ions of the and that the method is difficult to understand predicted variable are evaluated by substituting and implement. unique values as calculated from the equation of the regression curve. The assumption that samples of random-variable populations approximately conform to normal Figure 1 depicts this conventional approach, PDF's is generally a good one. In most cases which has advantages and drawbacks. where the assumption is not adequate, a simple transformation can be found to render the PDF's The advantages are that it is easy to of the transformed variable approximately understand, easy to automate, and easy to normal. In the rare case where even this is not display. Most people feel comfortable using pOSSible, the statistical-model approach could this approach. be handled in terms of a specific non-normal type of prohabil ity density ·function. Its main drawback is that the regression equation by itself is an incomplete model of the The SAS macro system called the Statistical­ population; it only predicts the central Model Toolbox (SMT) successfully overcomes the tendency of the data. This is not a serious other two drawbacks. The SMT does provide for drawback if the objective is to predict the truncation of the range of the random variable, expected value of the variable or of a linear and It makes the method simple to implement. function of the variable. The user need not be a statistician, although he must reorient his mental image of the process from predicting unique values to integrating over PDF's.

1304 The Statistical-Model Toolbox (SMT) grid-cell s into trapezoidal PDF's that approximate the distribution of elevations above The SKT is a system for managing univariate a datum in each cell. These PDF's are useful in PDF's for continuous, numeric, random variables. integrating functions that depend on depth. The SMT is a set of SAS version 5.16 name-style macros that uses only keyword parameters. The GENNORM and GENlNORM build hypothetical PDF's m.cros were written for IBM/OS and are normally (marginal or conditional) using specified invoked in the background environment. No statistics for normal and lognormal populations, user-interface has yet been provided to respectively. The statistics may be construct the macro invocations, but a high percent i 1es, the ari thmet i c mean, or (for degree of fault tolerance is provided by input GENlNORM) any arbUrary power-mean (e.g., verification utilities within the macro system. harmonic mean, geometric mean, root-mean-square, etc.) . Figure 3 is a conceptual flowchart for the SMT. The main elements of the system are: macros that The GEN macro makes PDF's using primitive model make PDF's; statistical-model datasets (SMDS's), parameters. Whereas PARMEST, GENNORM, and which serve as databases from which PDF~s can be GENlNORM make normal (Gaussian) PDF's, GEN can extracted and applied; and, application macros also make uniform (rectangular), generalized that use extracted PDF's to perform specific triangular (i.e., symmetric or asymmetric), tasks. trapezoidal, and spike (delta-function) PDF's. These non-normal PDF's may be used on an equal The SMT also includes auxiliary macros thatare basis with normal PDF's that are derived by used at either end of this legica1 flowchart, to parameter estimation. aid in the description of samples, or to look up answers in output datasets created by the application macros. Statistical-Model Datasets (SMDS's) Figure 4 is a detailed flowchart that shows the The core of the SMT is one or more actual macros. The following discussion statistical-model datasets (SMDS). An SMDS is a summarizes key macros and concepts. Sr.s dataset (normally permanent) that contains in canonical form all of the essential detail s about one or more statistical models. The user Making Models need not be concerned with the detailed contents of an SMDS. It is sufficient to know that PDF's Models (PDF's) can be made by performing can (and must) be extracted from an SMDS to be parameter estimation on samples or by creating used by an application macro. hypothetical models in the absence of data. Both classes of models accommodate double Each observation in an SHDS represents a truncation, single truncation, or no truncation marginal PDF or a class of conditional PDF's. in the permissible range of the random variable. Stored details include: the type (i.e., one of This is an important consideration when dealing the five available standard forms); the location with physical variables or with constructed parameter or an equation to generate it; the variables (e.g., ratios and percentages) whose scale parameter; the shape parameter (triangular values must be confined to known intervals. and trapezoidal types only); lower and upper truncation limits; and a parameter covariance The PARMEST macro estimates parameters for matrix if the PDF resulted from parameter normal PDF's by applying maximum likelihood estimation. estimation to samples. Both truncated and censored samples can be handled. A truncated One detail not stored by an SMOS is what sample has no values in a certain range (usually transformation if any was applied to the sample due to measurement or reporting problems) (or implied for the population) before capturing although these values are known to exist in the its description in the SMDS. The user must keep population. Censored samples contain inexact track of this transformation and be able to observations, which are effectively error bars invert it. rather than exact data. PARMEST can derivea marginal PDF from a univariate sample, or it can An SMDS can be the target of two macros: the derive a class of conditional PDF's from a QUERY macro, which lists relevant details to bivariate sample. For the latter case, either a allow the user to identify the contents of the linear or quadratic relation between the SMDS; and, the EXTRACT macro, which the user regressor and the location of the conditional invokes to prepare a specifiC univariate model PDF may be chosen, and the location maY for subsequent use by an application macro. optionally be restricted to match a specified value. Application Macros PARMEST embodies proprietary parameter estimation techniques; it is not just a shell The app1 ication macros do calculatinns or make that calls SAS statistics procedures. plots for either functions of one or two random variables nr for the PDF itself. Two univariate The DEPTH macro is in a sense a PDF's may be combined to form a bivariate joint parameter-estimation tool. It converts the PDF for the two random variables. spatial coordinates of geological-model

1305 The INTI and INT2 macros integrate arbitrary can make an output dataset. In effect it is the functions of 1 or 2 random variables, visual analog of PROC UNIVARIATE. Its abil ity respectively, over specified ranges of the to overlay curves for maltiple BY-values makes random variables using probability weighting. it useful for comparing multiple samples. These tools let the user calculate expected values of functions and cUlllUlative frequencies LORENZ draws the Lorenz curve and calculates the witOin specific ranges, both major quantitative Gini coefficient of non-uniformity for samples applications in the SMT. of exact data. It also accommodates BY-variable processing. The COF and HlSTGRAM macros let you display a cUlllUlative frequency plot and a frequency DP is a conversational macro that calculates , respectively, for arbitrary functions Dykstra-Parsons coefficients of variation for of 1 or 2 random variables. The cumulative approximately lognormal samples; This frequency plot is useful for certain coefficient of variation is a normalized measure quantitative determinations. Either the PROSI of the variance of the logarithm of a lognormal or PROB2 macro must be invoked to calculate variable. The user invokes DP from within SAS local probabilities before COf or HlSTGRAM can Display Manager. The interactive aspect is that be used. he sees a log-probabil ity plot of the sample, and he may trim off arbitrary amounts of the The UNISTAT macro analytically calculates most 1011- and high-percentile tails before performing statistics of univariate normal, uniform, . triangular, or trapezoidal distributions. The available statistics are percentiles, cumulative probabilities, and power means. It can also Benefits of Programming the SNT using the SAS handle the inversion of common and natural System logarithm transformations~ Because the calculations are analytic, the resulting The SAS System made it possible to program the statistics are more accurate than those th.t SMT in a single environment that provides a could be produced using the INTI macro. database facil ity to store stat i st iea1 models, analytical capabilities to handle the The SAMPLE I and SAMPLE2 macros generate applications, and device-independent graphics synthetic samples of specified size based on capabilities to display samples and selected distributions of I or 2 variables, distributions. respectively. This single environment allows the to The CPLOT macro plots selected "ellipses of assemble a collection of advanced applications concentration" for bivariate distributions. which presents a Simple and uniform appearance This lets the user replace clouds of scattered to the user. data with elliptical outlines that indicate where a specified percentage of the popul.tion For users who are statistically inclined and is concentrated. CPLOT is a useful qualitative familiar with the base SAS software, the SMT is tool for comparing multiple classes of similar open-ended. Two of the application macros data. The POF2 and LEVEL2 macros must be create datasets that represent univari ate and invoked ahead of CPLOT. bivariate joint PDF's respectively. These could be used in appl ications that are not currently available as macros, e.g., making perspective Auxiliary Macros views of the bivariate joint PDF surface that showell ipses of concentration as level curves There are 5 other tools that are used to either on that surface. look up answers from the output datasets in the application tools or to ch.racterize samples of exactdat.. These 5 are called auxiliary tools. Benefits of Programming the SMT as SAS Macros LOOKUP I and LOOKUP2 interpolate values of one or SAS macros permit complete flexibil ity in more numeric variables from a SAS dataset that defining to the application tools the functions is viewed as a table with a 1 or 2 numeric of one or two random variables that are to be arguments, respectively. integrated or displayed.

In a typical applic~tion, INTI or INT2 is used Those application macros that expect the user to to produce a "table" (dataset) of answers. The identify such a function or functions use two lookup macros can be used to interpolate answers keyword parameters for this purpose. One of the for each observation of a dataset that has parameters is a list of the variable names of arbitrary values for the argument variables. the functions. The second parameter is the name This process is demonstrated in the Examples of a "function macro· that the user must code to section. assign values to the functions in terms of the values of the random variables. PROSPLOT is a very flexible tool that calculates cumulative frequency and first-moment The function macro serves as a Trojan horse to distribution functions for samples of exact data carry the aSSignment statements into the OATA and plots them on a normal-probability scale. step within the application macro. It accommodates BY-variable processing, and it

1306 The function macro may be as simple as a sin91e need training in the concepts of probability assignment statement, or as complex as a lookup density functions if they are to realize the into a multi-argument table (SAS dataset). The full power of the SMT. Exxon is addressing this function macro may itself invoke other macros, training need by providing 1-day workshops at as long as they make sense within a DATA step. each computing site. A user-coded macro may also be used to define the relationship (if any) between two univariate Future Directions PDF's that are combined to generate a bivariate joint PDF. In the Simplest case, the value of Planned enhancements to the SMT include: tools one random-variable (the outer variable) that quantify error propagation; a Monte Carlo controls the location of the conditional PDF of tool; additional graphics; and a tool to the second (inner) random variable through a estimate parameters for a statistical model in linear or quadratic equation. In SMT which the distribution of a random variable is terminology, the outer variable serves as the conditional upon the values of multiple "driver" for the inner PDF. variables. In some circumstances the inner driver needs to Coding the tools as SAS procedures is also being be a function of the outer random-variable. considered. This is accommodated with a user-coded macro. Another benefit facilitated by coding the SMT as Conclusions a SAS macro system is the high degree of fault tolerance that is built into the macros. The The SMT offers an easy way to make, catalog, and SMT verifies syntax and logic within the macros use analytic models of real populations of because no foreground user- interface has been numeric random variables. Programmi n9 the written to date. statistical-model techniques using SAS macros makes them accessible to a large user community. The SMT achieves its great flexibillty by These tools let you make better use of scattered allowing a large number of input specifications data. via keyword parameters. Wherever /ossible, standard default values are assigne to the parameters. The flexibility also increases the David S. Frankel chances for the user to violate rules of syntax Exxon Company, U.S.A. or logic. For this reason, the macros verify P.O. Box 1600 every input item and terminate with an Hidland, Texas 79702-1600 informative diagnostic message when a problem is (915) 688-7790 identified. A final benefit of coding the SMT as a SAS macro * SAS is the registered trademark of SAS system is the ease with which a user can create Institut@ Inc., Cary, NC~ USA. personalized extensions of the system. SAS macro language is easy to read and easy. to learn. A reasonably skilled user can copy any of the applications macros and customize it to make new plots or different calculations. For example, the current application macros are intended to handle one statistical model (either univariate or bivariate) per invocation. There are applications where it is necessary to handle hundreds of thousands of statistical models. CPU time would be substantially reduced if an application macro could accept as input a large SMDS (one model per observation) instead of a pOinter to a single observation. This modi­ fication could be implemented with relative ease.

User Acceptance The SMT is currently installed at 5 Exxon computing sites. A user's guide provides extensive documentation. Cataloged example jobstreams serve as templates that users copy and customize for their own use. Most users intuitively grasp the advantages of the SMT approach relative to conventional prediction methods. However, the users often

1307 • • x

Figure 1: Conventional Prediction Is Based on Regression

Figure 2 Probability Population Models Density . Capture More Information

y

x

1308 Making Applying Models Models

Sample-Based Gunctions~ ...------.~ Catalog of Models --_

Synthetic Populations

Figure 3: Simplified Flowchart of SMT

Making Models Applying Models

INT1,2 LOOKUP1,2

PARMEST CDF Sample

DEPTH HISTGRAM

GENNORM Sto\. EXTRACT \------Model UNISTAT Dotosets GENLNORM PDFl

GEN QUERYC PDF2 QUERY Figure 4 Ellipses of Detailed Flowchart of SMT Concentration CPLOT

1309