<<

DENSITY ESTIMATION FOR FUNCTIONS OF CORRELATED

RANDOM VARIABLES

A Thesis Presented to

The Faculty of the

Fritz J. and Dolores H. Russ College of Engineering and Technology

Ohio University

In Partial Fulfillment

of the Requirement for the Degree

Master of Science

by

Jeffrey P. Kharoufeh

June 1997 ACKNOWLEDGMENTS

The author would like to express his gratitude and thanks to Dr. Helmut Zwahlen for his assistance and guidance in this work, to Dr. Thomas Lacksonen for his helpful suggestions and encouragement, to Dr. David Keck for his mathematical input, and to

Dr. Hollis Chen who served as College representative on the Committee. The author would like to give special thanks to Thomas Schnell and Dr. Richard Gerth for their time, input, and helpful suggestions. Chad Johnson, Chandrasekar Subramanian, and

Julie Kocsis also gave helpful suggestions for coding. I especially thank my parents,

Towfiq and Janette, for their continued love, support, and patience. CONTENTS

. . LIST OF TABLES ...... vii ... LIST OF FIGURES ...... VIII

CHAPTER 1. INTRODUCTION ...... 1

1.1. Background ...... 1

1.2. Statement of The Problem ...... 4

1.3. Review of the Literature ...... 7

1.3.1. Functions of Independent Random Variables ...... 7

1.3.2. Functions of Dependent Random Variables ...... 1 1

1.4. Approach ...... 15

CHAPTER 2 . ESTIMATING THE JOINT PROBABILITY DENSITY FUNCTION ...17

2.1. Approaches to Density Estimation ...... 17

2.1.1. The Histogram Estimator ...... 18

2.1.2. The Kernel Density Estimator for Univariate Data ...... 20

2.2. The Multivariate Kernel Estimator ...... 26

2.2.1. Introduction to Multivariate Kernel Estimates ...... 26

2.2.2. Selection of the Appropriate Smoothing Parameter ...... 29

2.2.3. Selection of the Appropriate Kernel Function ...... 30

CHAPTER 3 . FROM THE JOINT DENSITY TO FUNCTIONS OF R.V.'s ...... 34

3.1. The Transformation of Variables for Discrete Random Variables ...... 34

3.1. 1. A Small Example ...... 26 3.2. The Transformation of Variables for Continuous R.V.'s ...... 3 8

3.2.1. Derivations for Continuous R.V.'s Using the Transformation of Variables ...40

3.2.2. A Small Example ...... 44

CHAPTER 4 . IMPLEMENTATION OF THE APPROACH ...... 48

4.1. Discretization of the Problem ...... 48

4.2. Some Notation and Definitions ...... 50

4.3. Construction of the Joint Density Estimate ...... 51

4.3.1. Selection of the Grid, G ...... 51

4.4. Implementation of the Transformation of Variables ...... 53

CHAPTER 5 . EMPIRICAL RESULTS AND VALIDATION ...... 54

5.1 . Experimental Method ...... 54

5.1 . 1. Experimental Design ...... 55

5.1.2. Experimental Procedure ...... 57

5.2. Evaluation of Empirical Results ...... 64

5.2.1 . Tests of Goodness-of-Fit ...... 64

5.2.2. The Analysis of ...... 67

5.2.3. The Estimator Versus Enumeration ...... 82

CHAPTER 6 . SOFTWARE IMPLEMENTATION ...... 84

6.1 . Requirements of the Software Implementation ...... 85

6.2. Using the Software Package ...... 86 6.3. Limitations of the Program ...... 92

CHAPTER 7 . CONCLUSIONS, RECOMMENDATIONS. AND FUTURE WORK ...93

7.1 . Conclusions...... 94

7.2. Recommendations...... 95

7.3. Irnprovements and Future Research ...... 97

7.3.1. Improvements to Current Research ...... 97

7.3.2. Future Research ...... 98

REFERENCES ...... -102

APPENDICES ...... 106

APPENDIX A .1 Experimental Results ...... 107

APPENDIX A.2 Comparison of Cumulative Distribution Functions ...... 120

APPENDIX A.3 Means Tables and Scheffe Post-Hoc Tests ...... 139

APPENDIX A.4 Source Code ListingISample Input File ...... 150 LIST OF TABLES

1 . A(K) For Normal and Epanechnikov Kernels ...... 30

2 . Factors and Their Respective Levels For the Designed Experiment ...... 56

3 . ANOVA Table, F-Values Given at the 0.05 Level ...... 70 LIST OF FIGURES

1. Univariate Histogram for Old Faithhl Geyser Data with 107 Observations ...... -19

2 . Kernel Estimate with Varying Window Width ...... 23

3 . Gaussian Kernel Estimate of Old Faithfbl Geyser Data ...... 24

4 . Comparison of Histogram and Kernel C.D.F.s for Old Faithful Geyser Data ...... 24

5. Comparison of C.D.F.s for Two Kernels and Histogram Estimates ...... 32

6 . of Z = X + Y. Discrete Transformation of Variables Example ...... 38

7 . Probability Distribution of Z = X +Y. Continuous Transformation of Variables Example ...... 46

8 . Sample Scatterplot of Correlated Data (X-U(O.l). n = 100)...... 49

9 . Discretization of the Region Containing R for Construction of G ...... 52

10. Grid Constructed on Correlated Data From Figure 8 ...... 57

1 1. Sample Input Data File For Evaluation of Estimator ...... 62

12. Sample Ouput Data File output.out From Evaluation Program ...... 63

13. Comparison Between Estimator C.D.F. and Simulated C.D.F. for Product When Xis ~istributed or mall^ and R~ = 0.90 ...... 67

14. Pareto Chart of Mean Squares for Each Effect in ANOVA Model ...... 68

15. The Effect of Correlation on Maximum Absolute Error ...... 71

16. The Effect of Distribution Type on Maximum Absolute Error ...... 72

17. Interaction Plot for Correlation and Distribution Type ...... 73 18. Interaction Plot for Correlation and Operation...... 74

19. Interaction Plot for Correlation and Number of Observations ...... 75

20 . The Effect of Number of Observations (n)on Maximum Asolute Error ...... 76

2 1. Interaction Plot for Number of Observations (n)and Distribution Type ...... 77

22 . Interaction Plot for Number of Observations (n)and Operation ...... 78

23 . Interaction Plot for Distribution Type and Operation...... 79

24 . The Effect of Operation of Maximum Absolute Error ...... 80

25 . Comparison of MAE C.D.F.s for Estimator and Enumeration (n = 50, xis orm mall^ ~istributed.R~ = 0.90. Sum) ...... 83

26 . Comparison of MAE C.D.F.s for Estimator and Enumeration (n = 400. xis orm mall^ ~istributed.R~ = 0.90. Sum) ...... 84

27 . Sample Input Data File for Program CORREST ...... 87

28 . Sample Output Data File From Program CORREST ...... 89

29 . Flow of Program CORREST ...... 90

30 . Flow of Calculations for Program CORREST ...... 91 CHAPTER 1. INTRODUCTION

1.1. Background

The theory of probability and focuses primarily on random variables

(r.v.'s) and their associated probability distributions. Most random variables may be considered discrete or continuous while others are neither discrete or continuous. A discrete random variable is one in which the set of all possible values of the r.v, is finite or countable. In contrast, a continuous random variable is one in which the set of all possible values is infinite but uncountable. In either case, it is desirable to know the probability that the random variable takes on a single value or a range of values.

Suppose -Y is a discrete random variable. The function,

px (x) = P(X = x) (1.1)

gives the probability that the random variable X will take on the value x. This function, p, , is referred to as the probability mass function (p.m.f.) for the random variable .X'.

When X is a continuous random variable, the cumulative distribution function (C.D.F.) of X is defined as, F(x) = P(X 5 x) = j f (t)dt for - m < x < m

The function, f is referred to as the probability density function (p.d.f.1 for a continuous random variable and is analogous to the probability mass function for a discrete random variable. From (1.2) it is seen that the probability density function can be obtained as,

f (4= 7dF(x) for - m < x < m

Thus, we may determine the probability that a random variable is less than or

equal to some value by summing the p.m.f. up to that value for a discrete r.v. or by

integrating the p.d.f. for a continuous random variable. Often times, the probability mass function is also referred to as the probability density function.

Consider now, two random variables X and Y. These two r.v.'s may be labeled

as either independent or dependent. Two random variables, X and Y are said to be

independent if and only if where f (x, y) is the joint probability density function and f, (x) and f, (y) are the marginal densities for X and Y, respectively. If two random variables are not independent then they are said to be dependent. The difference between these tu.0 types of random variables considerably complicates the problem of determining the probability density function for any function of these random variables.

In this research, a function of random variables will refer to simple arithmetic combinations, such as addition, subtraction, multiplication, and division, being performed on two random variables. The algebraic manipulation of r.v.'s is an area which is largely uncharted in terms of practical results. Such manipulations are extremely useful for practitioners and researchers in the pure sciences, engineering, operations research, and economics. Although some work has been done in the area of independent r.v.'s, the dependent case lags far behind. Dependencies which exist between random variables may be nonlinear; however this thesis shall focus on the case when two random variables, X and Yare linearly dependent or correlated. 1.2. Statement of The Problem

Engineers use statistical data in a variety of settings. For instance, an industrial and systems engineer may be interested in determining the overall throughput for a certain manufacturing process; a chemical engineer may be interested in determining the expected yield in a certain chemical reaction; while a transportation engineer may be interested in determining the average volume of traffic through an intersection in order to increase driver safety by improving traffic control devices. In each of these settings, the engineer is required to physically collect and analyze data which consists of observations of the system. It can be argued that the behavior of most real-world systems is stochastic rather than deterministic. Therefore, a responsible analysis of any phenomenon should include the inherent probabilistic behavior of the system. Measures such as the mean and variance are certainly useful in describing the main features of a distribution; however, in decision-making processes, the most valuable pieces of information available are the probability density function or the cumulative distribution function.

When the probability distribution of a given phenomenon is known, more intelligent decisions can be made regarding the performance of a given system. For example, rather than stating that the average throughput of a given system is 20 units per day, the industrial and systems engineer may report that with probability 0.025, the throughput of the system will be greater than 25 units. For these reasons, probability density estimation has become an important tool to the engineering analyst. Many methods exist for estimating the probability density

function of a single random variable; however, in real-world systems, multiple random variables are usually involved which are interacting in some way to produce a random response. Hence, it is expected that this random response follows some probability distribution. For example, the throughput time of a product can be represented by a

random variable which is the sum of the completion times of each operation in the manufacturing process. The completion times for these operations are also random variables which may be correlated, thus complicating the analysis. For instance, the time to complete the ith operation may depend on the type of work which was done on the part in the (i-1)st operation. In this particular example, one would seek the probability

density function for the sum of multiple correlated random variables. Thus, if

X ,X ,. . . ,X are n correlated random variables describing the operation times for the n

stages of a manufacturing process, then the resulting random variable may be given as,

Consider now an example from the field of human factors engineering.

Anthropometry deals with the measurement of dimensions of the human body. Most

often, human factors engineers are interested in designing products with one anthropometric dimension involved, such as the height of an individual. Designers choose dimensions for their designs based upon the 95th percentile of males and the 51h percentile of females. However, when multiple dimensions must be considered, the problem becomes more complicated. Specifically, if the percentile rules are applied, for example, to determine the 5' percentile height of females, the true measure is a sum of bone lengths between several joints. Summing the various lengths for each 5' percentile value does not yield the 5" percentile female [29]. These lengths are actually correlated, but when treated as uncorrelated, yield poor estimates of the true height of the sth percentile female. A more accurate analysis would include the inherent correlation between the different bone lengths in the human body.

Many responses may be modeled as the sum, difference, product or quotient of random variables or combinations of some or all of these. In general, it is difficult, if not impossible, to solve for a closed-form, analytical solution for such a density. For these major reasons, it is desirable to develop a practical technique by which an engineer may closely approximate the probability distribution for the sum, difference, product or quotient of two correlated random variables. Stated mathematically, let X and I.' be two correlated random variables and (x,, i = 1,2,, n be a set of ordered pairs representing n observations of X and Y. It is desired to construct an approximation of the probability density and cumulative distribution functions for the following random variables: 1.3. Review of the Literature

1.3.1. Functions of Independent Random Variables

Since the early part of this century, much attention has been given to the derivation of the probability density function for the sum of random variables. Several authors have considered the sum of independent random variables such as Aroian [I],

Crarner [8] , Springer [3 11 and Giffin [13] . It is well-documented that the distribution of the sum of two independent random variables is given by the convolution of their respective density functions. Thus, if X and Y are two independent random variables, then the distribution of their sum, Z = X + Y, is given by,

if X and I.' are discrete random variables and, if X and Y are continuous random variables, where f and g are the p.d.f.'s for X and Y. respectively. In the above equations, (*) represents the convolution operator or "folding" of two functions. These relationships are easily obtained due to the assumption of independence of the two random variables. At times, the evaluation of such summations or integrals is tedious and time-consuming. Gifin [13] presents Borel's Theorem for dealing with this problem. Essentially, Borel's Theorem states that the convolution of two functions is the inverse of the product of their transforms. Thus, if T represents an appropriate transformation, then

From (1.9), it is seen that recovery of the density function for the random variable Z may be achieved by inverting T(h(z)) where, Giffin [13] goes on to show that this result can be extended to sums which involve more than two independent random variables. Specifically, if one seeks to determine the probability density function for the sum of n-independent random variables given by

then one must invert the transform

where f, is the density function for the random variable, XI .

Hence, (1.10) and (1.1 1) show that the distribution for the sum of two or more independent random variables may be obtained directly through the convolution sum

(integral) or through the transform approach. However, inversion of the transform result may be difficult, if not impossible. It is at this stage that the technique becomes impractical for the engineer. Yu [38] and Hu [15] demonstrate that much of the tedium may be avoided by accepting a very close approximation of the resulting density. Such an approximation is obtained by representing each of the n-independent distributions by a histogram estimator. Silverman [3O] points out that such an estimator can be extremely effective in approximating univariate distributions due to its simplicity. Using the histogram, each distribution is replaced by a series of adjacent uniform distributions defined over a specified interval. When the r.v.'s are continuous, the Laplace transform may be used. This transform and its properties allow for quick transformations of each uniform density whlle tabulated results for the Laplace allow for relatively straightforward inversion of the final product. When the random variables are discrete, the analogous transform to the Laplace, the Z-transform, may be used.

The distribution for the difference of independent random variables is easily attainable since the operations of addition and subtraction are analytically equivalent.

More specifically, the difference, Z = X - Y may be rewritten as Z = X + (-Y), and then the addition result applied. An obvious hindrance to this approach is that one must account for values which are less than zero. Hu [15] utilizes the bilateral Laplace transform to account for this complication, showing that the result is easily obtained.

Determining the distribution of the product or quotient of random variables is a problem which has received far less attention. Craig [7] considered the distribution of the product and quotient of two normal random variables. In fact, most researcher5 working in this area concentrated their efforts on random variables with known probability density functions. Epstein [lo] first demonstrated that the Mellin transformation was the prominent tool to be used in determining the product and quotient p.d.f.'s for two random variables. Springer and Thompson [32] derived results for the distribution of the multiplication of n-independent random variables. Although these results are significant, they are of little practical value to the practicing engineer.

As shown by Yu [38] , the p.d.f.'s for the product and quotients of independent random variables may be closely approximated in a manner similar to that of the sum and difference results. However, for such densities, the Mellin transformation is used rather than the Laplace, assuming that the random variables are continuous. Since the

operations of multiplication and division are analytically equivalent, the quotient can be obtained by recognizing that the quotient, Z = X/Y may be rewritten as Z = X(l/Y).The reader is referred to Hu [15] for more on this approach.

1.3.2. Functions of Dependent Random Variables

It is apparent that much work has been done in the combination of independent random variables. However, in many real-world systems, the assumption of independence of the random variables is not a valid one. Far less work has been done in the area of determining the probability density function for the combination of dependent random variables. Giffin [14] presents results for means and for some dependent combinations, but no practical approach to solving for the overall distributions.

Springer [31] points out that when the assumption of independence is removed.

the transform of the p.d.f, of the sum of dependent random variables is no longer equivalent to the product of the transforms of each r.v., thus complicating the matter.

The author notes that in order to determine the sum (or difference) of n-dependent random variables, one must first solve the multivariate Fourier transform given by,

where f represents the joint density of the n random variables. Now, letting, t, = t, i = 1,2,.. . ,n and evaluating the associated inversion integral yields the density of the sum of the dependent r.v.'s given by,

From (1.13) it is seen that density of the sum (or difference) may only be obtained if the joint density is known. In such case, however, the inversion integral may be difficult, if not impossible to obtain. The author proceeds to demonstrate the sum of two dependent normal r.v.'s and n-dependent normal r.v.'s. The multivariate Fourier transform, though providing a closed-form solution under the right conditions, does not readily allow for the analysis of real-world data. Springer concludes his discussion of dependent random variables by presenting some results for the product and quotient of two dependent random variables. The author presents the two-dimensional Mellin transform and its inverse which exist only under the stringent conditions provided by two theorems. If (XY) constitutes a random vector such that X and Yare not independent, then denote their joint density by f (x, y).

Springer limits the two random variables to only positive values. Thus, the two- dimensional Mellin transform is given by

which has the inverse

1 f (x,y) = 71 [x-'l y-'? M(s, ,s, )ds,ds, (27n') h-rm k-rm

This transform and its inverse exist only when the conditions of two stated theorems are satisfied. The reader is referred to Springer [31] for these theorems. These results, though interesting, provide no means by which the practitioner can obtain the resulting product or quotient of two or more dependent random variables. The recent literature also indicates the need for a pragmatic approach to solving for the probability distribution of functions of correlated random variables. Many authors solve applied problems which inherently contain correlation, but circumvent the handling of this dependency. For instance, Martel and Ouellet [20] present a problem in which a resource is allocated among partially interchangeable activities for which the demand levels of the activities are highly correlated, continuous r.v.'s. The authors reformulate the problem as a deterministic convex allocation problem through parametric programming, thus eliminating the need to consider dependencies among the demand levels.

Others provide some properties of combinations of dependent r.v.'s, but offer no technique for obtaining such combinations. Cover [6] provides results for the maximum of the sum of two dependent random variables, while Rinott [25] gives results for normal approximation rates for sums of dependent random variables.

The author investigates a method by which the order of normal approximation errors are improved for a limited class of bounded random variables. Neither author provides a description of how one would go about describing the distribution of such sums of dependent random variables.

Finally, Curran [9] presents one of few recent works in which a real-world problem is solved which involves the sum of correlated lognormal variables. The problem involves determining the for the option payoff when evaluatin:

Asian or European options on portfolios. The author points out that the option payoff is a sum of correlated lognormal random variables. Curran computes the expected option payoff by conditioning on the geometric mean price. However, the approach only computes expected values for such option payoffs and does not consider the overall probability distribution of the resulting random variable.

It is evident that there does not exist a unified approach by which the practicing engineer may obtain the resulting probability distribution for a function of correlated random variables.

1.4. Approach

As shown by Giffin [13] and Hu [15] , the approach to solving the independent case involves the use of the Laplace and Mellin convolutions of two (or more) densities and inversion to obtain the desired result. However, when the two variables cannot be assumed to be independent, these techniques are no longer appropriate.

It is the purpose of this research to provide a means by which an engineer may approximate the resulting probability density function and cumulative distribution function for a function of two correlated random variables, given a finite sample of observations from each of the two populations. The approach used to obtain this approximation is given herein.

Initially, the approach relies on the collection of data which is obtained from observation of some system. Specifically. real-world data is physically collected. or sampled from an assumed theoretical distribution to obtain a set of ordered pairs. If the joint probability density function is known for the two random variables, then the analyst

essentially has all of the components necessary to construct the probability distribution of some function of the random variables. However, in practice, the joint probability distribution is seldom known. Thus, using the collected data, an estimate of the joint probability density function is constructed using the kernel density estimator.

Once the estimate of the joint probability distribution is known, functions of random variables, whether discrete or continuous, may be obtained using the transformation of variables technique, which is well documented in any text on mathematical statistics. A description of the technique is given in future chapters. The reader is referred to [l 11 for a more thorough treatment of the technique.

In order to demonstrate the capabilities of the approach, a small software package was developed which provides for a comparison between the approximation technique and a Monte Carlo simulation of several test problems. The technique was tested on problems with varying degrees of correlation, different sample sizes, and different input

distributions.

This technique will benefit industrial and systems engineers by 1) allowing them

to include in their analyses, the inherent linear dependencies in real-world systems. 2)

providing a practical solution to a rather complicated problem, 3) providing a foundatior,

with ~vhichthe technique may be expanded to higher dimensions. CHAPTER 2. ESTIMATING THE JOINT PROBABILITY DENSITY FUNCTION

2.1. Approaches to Density Estimation

As defined by Silverman [30], density estimation is the attempt to construct the

probability density function for a set of observed data. There essentially exist two major

approaches for performing density estimation: parametric and nonparametric. In

parametric density estimation, the focus is on fitting a known probability distribution,

such as the normal or exponential, to collected data. The analyst is interested in finding

good estimates for the parameters of the distribution so that the data may be modeled

ideally by the theoretical distribution.

In nonparametric density estimation, the goal is to allow the data to stand on its own.

That is, an attempt is not made to fit the data to a well-known parametric distribution, but rather, alternative methods are used to determine the form of the density hnction.

The two major nonparametric approaches are the well-known probability histogram and the kernel estimator. In this work, it was desirable to obtain a nonparametric estimator

for the joint density function which would be somewhat insensitive to the inputs and

provide an adequate estimate of the true density. Furthermore, due to the fact that some

phenomena simply cannot be modeled by an idealized distribution, nonparametric estimation becomes a much more appealing approach. Both the histogram estimator and the kernel estimator will be described below.

2.1.1. The Histogram Estimator

The most well-known and widely used density estimator is the histogram estimator. Given a set of n observations for a single variable, denoted X, ,i = 42,. . . ,n, the histogram is constructed by selecting an origin, x, and a bin width, h. Thus the bins of the histogram are the intervals [x, + kh), for k E Z, the set of integers. For a single variable, the density estimate is given by,

N f (x) = -n h

where N, is the number of observations falling in the same interval as x. The

"smoothness" of the histogram is dictated primarily by the choice of the smoothing parameter, or bin width, h. This parameter may vary from bin to bin or be fixed at a given value. Figure 1 is a histogram estimate for a benchmark univariate data set, the

Old Faithful Geyser data. The Old Faithful Geyser at Yellowstone National Park is one of the most famous in the world. On occasion, the geyser erupts, shooting steam and hot water into the air. The data represent eruption lengths measured in minutes for the geyser.

1.67 2.00 2.32 2.65 2.97 3.30 3.63 3.95 4.28 4.60 4.93

Eruption Length (mins.)

Figure 1. Univariate Histogram For Old Faithfbl Geyser Data [30] With 107 Observations

The histogram estimator provides a few advantages, such as ease of computation and simple graphical representation which lends to meaningful data exploration. The major drawbacks of the histogram are its lack of precision and its discontinuities, particularly if derivative information is required from the density estimate. Furthermore, the histogram requires specification of an origin and bin width. According to Silverman

[30], when density estimates are needed as a component for other techniques, as is the case in this research, alternatives to the histogram may be more appropriate. 2.1.2. The Kernel Density Estimator for Univariate Data

An alternative nonparametric density estimator is the kernel density estimator

(KDE). The kernel estimator is best explained by first introducing the "naive" estimator.

Given a set of observations for one variable, X,,i = 1,2, ..., n, it is known that the probability density function for a continuous random variable, X may be defined as

P(x-h< X

Thus, an approximator for the density function given above may be obtained by allowing h to be very small and using the relative frequency definition for the numerator, yielding

No. of observations in (x - h,x + h) n

Thus, it is seen that if an arbitrary observation, X, is within h units of the fixed point of interest, x, then the observation has an influence on the density estimate at x. Otherwise, the observation has no influence. In a similar manner, by applying a weight function rather than the relative frequency, one may construct the "naive" density estimator. Let w(u) = otherwise

The fixed point is tested against each observation to determine the observation's influence on the density estimate at that point. Thus, by (2.4), the influence of a given observation, X, is given by

It is seen that if an observation contributes to the density estimate at x, then the estimator centers a box of width 2h and height (2hn)-' over the observation. Now by summing vertically, the "influence" of each observation the nalve density estimator is given by

Consider now, replacing the weight function of (2.4) with a different function which places "bumps" at each observation, rather than boxes. Such a function will be referred to as the kernel function, K , thus defining the kernel estimate as Here, the kernel function is usually, but not necessarily, a probability density function itself, thus satisfying the condition

The shape of the kernel function dictates the shape of each bump while the smoothing parameter determines its width. As expected, when the window width increases, the density estimate becomes more smooth and less sensitive to the curvature of the true density. When the smoothing parameter is small, fine details are given in the density estimate, which may or may not be desirable. Figure 2, adopted from [30] demonstrates the influence of the smoothing parameter on the kernel estimate.

Silverman [30] and Scott [29] have shown that the kernel method is superior when estimating the density of a single random variable, X. Recall from the previous section, the histogram constructed from 107 observations of the Old Faithful Geyser

data, which is a common benchmark for the evaluation of univariate density estimators.

A density estimate was constructed using the Gaussian kernel function based on these

107 observations. This kernel is given by The Gaussian kernel was chosen due to the fact that it is a smooth, symmetric, unimodal probability density function. Figure 3 shows the kernel density estimate.

Figure 2. Kernel Estimate With (a) Small Window Width (0.2) and (b) Large Window Width (0.80) Adopted From Silverman [30]. 0 2 3 4 5 6 Eruption Length (mins.)

Figure 3. Gaussian Kernel Estimate of Old Faithful Geyser Data.

A comparison of histogram and kernel C.D.F.'s is given in Figure 4.

I

------t Histogram - -- Gauss~anKernel - + ------I I

Eruption Length (mins.)

Figure 4. Comparison of Histogram and Kernel C.D.F.'S For Old Faithhl Geyser Data. Even with the bimodal structure of the original data, it is clearly seen that the cumulative distribution functions coincide, confirming that the kernel provides a good estimate of the true density function.

The advantages of using the KDE are quite obvious in the univariate case. In constrast to the histogram which requires specification of the origin and bin width of its cells, the kernel density estimate requires only specification of the smoothing parameter, h. Furthermore, in the univariate case, the KDE provides a more appealing graphical display of the distribution for interpretation. In the following section, it will be seen that most of the subjectivity is removed in selection of the smoothing parameter by selecting the appropriate value for a chosen kernel function and sample size (n).

In the multivariate case, the histogram has major problems. In addition to its awkward physical interpretation, the multivariate histogram requires specification of origin and bin width in all coordinate directions. By contrast, a fixed smoothing parameter may be used in all coordinate directions with the KDE. These issues will be discussed further in the following sections. 2.2. The Multivariate Kernel Estimator

2.2.1. Introduction to Multivariate Kernel Estimates

When considering functions of random variables, one is interested not in estimates of univariate distributions, but rather the joint probability density function.

The kernel method's superiority to the histogram is best realized in the multivariate setting. The multivariate form of the kernel estimator is given by

1 " X-X, =?p( h )

where X is a row vector and d is the number of variables being consider. The emphasis of this work will focus on the case d = 2. Thus, X,= (x,,y,), i = 1,2 ,..., n constitute a set of ordered pairs which are the empirical observations. The smoothing parameter is again given by h. It is seen that the kernel estimate for multivariate data is analogous to the univariate case in that we consider the "distance" from our point of consideration, X and the observations, XI,i = 42,. . . ,n . Furthermore, the kernel functiorl will. now satisfy and possibly be a unimodal, symmetric probability density function.

Some concerns arise when using the kernel estimate given in (2.8). If there exist extreme differences in variance for the component column vectors, then one cannot expect a good estimate for the overall joint density. Silverman [30] suggests a method for "pre-scaling" the data so that this complication may be alleviated. More specifically, the author suggests that one linearly transforms the data to have unit covariance matrix according to the technique suggested in [12], smooth with a unimodal, symmetric kernel, and then transform the data back. These three steps are all implemented by using a density estimate of the form

where z represents the covariance matrix for the data, 1x1 is its determinant, and x-' is its inverse. All other variables are defined as previously. By inspection of (2.9), it is seen that the quantity h2(-XiT (X - X) reduces to a scalar value for i = 1,2, ... ,n . Thus, using a substitution of variable, let

u, =h-2(~-~,)T~-'(~-~i),i=l.? ....,n Each "bump" is obtained by evaluating K at the point u, . Two of the more common kernel functions in the multivariate case are the multivariate normal and the multivariate

Epanechnikov. The multivariate normal is given by

and the multivariate Epanechnikov is given by

lo elsewhere

where c, is the volume of the d-dimensional unit sphere. Thus, for example, if the

Epanechnikov kernel is being used for estimation of the joint density for two random variables, the value of the kernel function would be given by

(0 , othenvise

A discussion of selection of the kernel function and smoothing parameter is given in the following sections. 2.2.2. Selection of the Appropriate Smoothing Parameter

In the multivariate case, selection of the smoothing parameter is of great

importance. For the purposes presented herein, a fixed smoothing parameter will be used in each coordinate direction, however, it should be noted that a vector of different

smoothing parameters could be used. The reader is referred to [30] or [29] for details on

a variable smoothing parameter. The use of a single smoothing parameter is in

accordance with (2.9) to construct the density estimate.

The smoothing parameter is chosen to minimize the mean integrated squared error (MISE), as per Silverman [30] and Scott [29]. Thus, to obtain the smoothing

parameter which minimizes MISE, hereafter referred to as h, , calculate

where A(K) is a constant value which depends on the type of kernel used, n is the number

of real-world observations, and d is the number of variables under consideration. The

values for A(') are adopted from [30] and are given below in Table 1 for the Normal and

Epanechnikov kernel functions. Hence, a choice of the kernel function is made and the

optimum smoothing parameter can be computed as a function of this choice, the number of real-world observations taken, and the number of variables. In this research, the smoothing parameter remained fixed at the value h,, . However, it is possible to select the smoothing parameter subjectively. Often times in practice, the optimum smoothing parameter is selected as a starting point, and then "fine-tuned until the desired smoothing is achieved.

Table 1. A(K) For Normal and Epanechnikov Kernels. (Adopiedfrom [30], pg. 87.) Kernel Number of Variables A (K) Multivariate Normal 2 0.96

d {41(2d + 1))ll(d+4) Epanechnikov Kernel 2 1.77 3 2.78 d l/(d+4) 8d(d + 2)(d + 4)(2&)d (2d + 1)c; 1 *Note: Cd is the volume of the unit d-dimensional sphere.

2.2.3. Selection of the Appropriate Kernel Function

One of the major advantages of using the KDE is that the resulting probability distribution inherits the properties of the kernel function chosen. For instance,. if the normal kernel is chosen, the resulting density will be a "smooth'' function possessing derivatives of all orders. In some instances, this may be a desired property; however, in other applications, it may not be a significant factor. Another factor in choosing the appropriate kernel function is the question of computational complexity. In years past, researchers were limited by the computational effort required to implement such techniques as the kernel estimate. However, with the power and efficiency of modern workstations and personal computers, implementation of the kernel estimator is trivial. Although it is seen that the normal kernel of the previous section requires more computational effort (due to evaluation of the exponential hction at each call) than the Epanechnikov kemel, either could easily be implemented. It is concluded that, given the computational power of modem personal computers, the selection of the kernel function should not be based upon computational complexity of the function.

Silverman [30] and Scott [29] both point out that the quality of the resulting density estimate is primarily dictated by the choice of smoothmg parameter rather than the kernel function itself. Since the appropriate smoothing parameter is obtained directly from (2.1 I), the choice of the kernel function is arbitrary.

Throughout the literature, the Epanechnikov kernel is given a great deal of attention. In the univariate case, Silverman [30] demonstrates that on the basis of the mean integrated squared error, the Epanechnikov kernel is the most efficient kernel function to use. For this reason, the Epanechnikov kemel is used to construct the joint probability density estimates in this research. Recall this kemel function is given by 10 , otherwise

where u, is obtained as in (2.10).

The Epanechnikov and normal kernels were implemented on a test problem consisting of bivariate data. Both X and Y were distributed uniformly on (0,l) and the sample correlation coefficient was found to be 0.921 1 (R~= 0.85). The kernel estimates were compared to a histogram estimate by means of the C.D.F.'s for each. Figure 5 demonstrates that the kernel estimates are very similar,confirming that the choice of kernel function is really irrelevant.

Figure 5. Comparison of C.D.F.'S For Two Kernels and Histogram Estimates. Thus, good density estimates can be obtained by using the kernel estimate in place of the more traditional histogram estimator. Particularly in the multivariate case, the kernel estimate has the advantage of using a fixed smoothing parameter in all dimensions and a means by which the variance in different coordinate directions may be accounted for. It will be seen later that although the procedure for determining the smoothing parameter is automated, the analyst is still required to determine at which points the density estimate needs to be evaluated. In the next chapter, the link between the density estimator and functions of random variables will be presented. CHAPTER 3. FROM THE JOINT DENSITY TO FUNCTIONS OF R.V.'s

Distributions of functions of random variables are readily attainable once the joint probability density function (or a good estimate of it) is known. The means by which this may be achieved is the transformation of variables technique. The technique will be utilized in transforming the density estimates obtained in the previous chapter into density estimates for functions of correlated random variables. The discrete case will first be considered, followed by the continuous case.

3.1. The Transformation of Variables for Discrete Random Variables

Let X and Y be two discrete random variables and denote their joint probability density function by J: Recall that the objective is to find the distribution for the random variable Z = u(xY) where u is some function of X and Y. The transformation of variables technique transforms the joint p.d.f. of X and Y to the joint p.d.f. of X and Z. It should be noted that the joint p.d.f. of X and Y could also be transformed to the joint p.d.f. of Y and 2,but the former is chosen in this work arbitrarily. The procedure for obtaining the probability density function for the random variable Z = u(X Y) is as follows: i.) Using the function u, solve for y in terms of x and z, say y = w(x,z). ii.) Now, g(x,z) = f (x,w(x,z)) iii.) Sum g(x,z) over all x to obtain the p.d.f. for 2, call it h(z).

Stated more precisely, for the value z,

Obviously, the function g will take on the value 0 often since Z assumes the value z only when z = u(x,y). The summation of (3.1) is equivalent to summing P(X = x, Y = y) for all (x,y) such that z = u(x,y). This probability is precisely f(x,y) since the random variables are discrete. Define Sz = {(x,y)lz = u(x, y)) . The probability that Z assumes the value z is given by

Using this approach, the distribution of the sum, difference, product, and quotient of two discrete random variables can easily be obtained given their joint probability density function. Furthermore, expressing the probability distribution in the form of

(3.2) provides a simpler means by which a software implementation can be achieved. Of course, if the random variable Y may assume the value 0, then the quotient, Z = XIY is not defined. Examples throughout this work assume nonzero random variables for the purpose of demonstrating the technique.

3.1.1. A Small Example

Consider an example problem to demonstrate the transformation of variables for two discrete random variables. The joint probability density function for two random variables, X and Y, is given in tabular form below. Suppose the probability distribution for the random variable Z = X +Y is sought.

It is first noted that the finite sample space for Z is given by S = {2,3,4,5,6,7,8,9,10}.

Thus, the probability distribution of Z is obtained by finding h(z),b' z E S . By (3.2) the joint density may be summed over the set of all ordered pairs such that the elements sum to the z under consideration. Stated more precisely, h(z) = P(Z = z) = C f (x, y) (x.Y)ES:

where S, = ((x,y)lz = x + y) . The calculations for a few of the values in the sample space are given below.

Checking to see that the results give a valid probability density function, it is found that

Figure 6 shows the distribution of the discrete random variable, 2. Figure 6. Probability Distribution for Z = X + Y, Discrete Transformation of Variables Example.

3.2. The Transformation of Variables for Continuous R.V.'s

Now suppose X and Yare two continuous random variables with joint probability density function J The transformation of variables technique for continuous random variables allows one to obtain the joint p.d.f. of X and Z and then to integrate out X to arrive at the marginal density of 2. It should be noted again that the joint p.d.f. of X and

Y could also be transformed to the joint p.d.f. of Y and 2,but the former is chosen in this work arbitrarily. The procedure for obtaining the probability density function for the random variable Z = u(X Y) is as follows: i.) Using the function u, solve for y in terms of x and z, say y = w(x,z). ii.) Now, g(x, z) = f (x, w(x,2)) iii.) Integrate g(x,z) over all x to obtain the p.d.f. for Z, call it h(z).

From any standard text on mathematical statistics, such as [l 11 the joint p.d.f. for X and

Z is easily obtained from the joint density f by

Hence, the probability density function for the random variable Z, which is a function of the continuous random variables X and Y, is given by

In an ideal setting where the true joint density function for the correlated rahdom variables is known, a closed-form, analytical solution for the p.d.f. of Z is attainable. In the two-variable case, the partial derivative of (3.3) is trivial for the purposes of this research, but may become more complicated as the function u becomes mors complicated, It is seen that in the case of the product Z =northe quotient Z = XIY. the hnction y = w(x,z) places a restriction that neither X or Y may assume the value of zero. This will be clear in the following section which provides the specific means by which one would obtain the distribution of the sum, difference, product, and quotient of two continuous random variables, whether independent or dependent.

3.2.1. Derivations for Continuous R.V.'s Using the Transformation of Variables

Let X and Y be defined as above and let Z = u(X;Y) = X + Y. To find the probability distribution of Z, it is noted that

This will always be true when solving for the simple sum of X and Y. Hence, to obtain the joint p.d.f. of X and Z it follows from (3.3) that

= f(x,z-x).111

= J' (x,z - x) Thus, the p.d.f. of Zwill be obtained by

In the case of addition, no restrictions are placed on the random variables, as the partial derivative is simply 1.

Now consider the difference of the two random variables, namely,

Z = X - Y. Proceeding in a manner similar to that for the sum

Thus, the joint p.d.f. of X and Z is given by

and the marginal of Zis obtained as Again, it is seen that in the case of subtraction, no restrictions exist on X and Y.

For the product and quotient of the two random variables, one must be concerned about the case in which X = 0. Although it is true that P(X = x) = 0 for all x when X is continuous, this does not imply that it cannot occur.

Thus, consider the product of the two random variables, namely, Z = XY. Solving for y as a function of x and z yields

Thus, the joint p.d. f. of X and Z is given by and the marginal of Zis obtained as

Finally, consider the quotient of the two random variables, Z = XIY. Obtaining y as a function of x and z and proceeding as previously gives

Thus, the joint p.d.f. of X and Z is given by It is noted here that Zmay assume the value 0 only if X assumes the value zero. Hence, the restriction shall be placed on X, yielding the marginal of 2.

3.2.2. A Small Example

Consider now an example to demonstrate use of the transformation of variable technique for two continuous random variables. Suppose X and Y are two continuous random variables with joint probability density function given by

24xy , Olxll,O

and the probability density function of Z = X + Y is sought. Begin by solving for the joint probability density function of X and Z. thus

Now, the joint density of X and Z may be integrated with respect to X, leaving the marginal distribution of the random variable Z.

Finally, the probability density function for Z = X + Y is given by 4z3, O

Figure 7 gives the probability distribution for the continuous random variable, Z = X + Y.

Figure 7. Probability Distribution for Z = X + Y, Continuous Transformation of Variables Example.

It is seen that for the functions of random variables being considered here, the transformation of variable is somewhat practical since the partial derivatives are easily attainable. In the next chapter, it will be shown that most of the tedium involved in obtaining the distribution for functions of two correlated, continuous random variables may be circumvented by discretizing the region on which X and Yare defined. Thus, this research will utilize the equations of Section 3.1 and the kernel estimate of the previous chapter to obtain overall estimates for the sum, difference, product, and quotient of two correlated random variables. CHAPTER 4. IMPLEMENTATION OF THE APPROACH

The two previous chapters presented the major components necessary for the construction of the distribution of functions of random variables. First, it is necessary to know or estimate the joint probability distribution of the two random variables. The transformation of variables approach may then be applied, whether the random variables are correlated or uncorrelated, to obtain the desired results. In this chapter, the exact procedure for integration of the kernel estimate and the transformation of variables is presented.

4.1. Discretization of the Problem

If the random variables under consideration are discrete, the joint probability distribution is easily obtained by using the relative frequency definition of probability and applying the discrete transformation of variables directly. For this reason, the focus of this research is on continuous random variables on some region of the xy-plane.

Although the transformation of variables technique for continuous random variables was presented in the previous chapter, the problem will be discretized so that the discrete transformation may be applied. This simply involves choosing a large ( but finite) number of points on the region at which to evaluate the densih estimate. Using the discrete form enables the analyst to avoid the problems of division by zero in the case of the product of two random variables previously mentioned while providing a convenient means for software implementation. Figure 8 depicts 100 observations of two correlated random variables on the xy-plane. It is necessary to discretize this region for implementation of the procedure.

Figure 8. Sample Scatterplot of Correlated Data (X - U(0, l), n = 100).

Details on construction of the density estimate will be given; however, some useful notation and definitions are first introduced in the following section. 4.2. Some Notation and Definitions

The technique begins with the input of data in one of two forms, either 1) real- world observations, or 2) sampled observations from known theoretical distributions. In either case, the data consists of an equal number of observations of the two variables.

Denote the number of observations by n.

Let R = {(x, ,y, )I i = 42,. .. ,n} denote the set of n-observations and denote the i lh observation by X, = (x,,y,) . It should be noted that R is conatined in some region of the xy-plane. Recall that in order to construct a joint density estimate for the random variables X and Y, each observation, X, = (x, ,y, ) must be compared with a set of fixed points, which comprise a grid. Let m be a positive integer which equals the total number of fixed points at which the density estimate will be constructed. Define

G = ((xi ,y; )I j = 1,2,... ,m} as the set of all fixed points on this grid. Note, G will always be finite since the region upon which X and Yare defined is discretized, but will contain a large number of points.

Finally, recall the function u, which describes the way in which X and Yare to be combined. For example Z = u(X,Y) = X + Y , describes the sum of the two random variables. The sample space for the resulting random variable Z shall be denoted by

, , Q = {rir = ~(x,,y,), j = 1,2, ... rn) . The set Q will contain nl elements. Thus. the function u maps points in G into Q, and u is not necessarily a one-to-one function. 4.3. Construction of the Joint Density Estimate

4.3.1. Selection of the Grid, G

In order for the kernel density estimate to be constructed, many points in the region containing the x-y pairs must be evaluated by the equation given in (2.9). By inspection of Figure 8, for example, it is seen that many points on the region will yield a density of 0; thus, it is desirable to construct as good an estimate as possible while eliminating evaluation of unecessary points. It is recommended that the user simply identify the minimum and maximum values of the observations in both directions and divide the respective ranges into equally-spaced cells with the minimum observation serving as the origin. Let xm,,x,, ,ym,,, y,, denote the minimum and maximum observations for X and Y, respectively. Define the range of values for X and Y to be

Rx = 'maw - 'mi"

Ry = Ymaw - Ymin

respectively. The number of cells chosen in this work was 50 in each coordinate direction, which yields 2401 points for evaluation of the density estimate. This value was chosen because of the large number of points of evaluation (2401) and due to the fact that a PC implementation may run into memory complications with a "tighter" grid.

Hence, the step-size of the grid in the respective coordinate directions is given by Figure 9 gives a graphical depiction of the way that the region is divided into equally- spaced cells.

khx )I Figure 9. Discretization of the Region Containing R For Construction of G.

Thus, construction of the KDE, f , may now be achieved by solving 4.4. Implementation of the Transformation of Variables

In the previous section, it was seen that the continuous region containing R may be discretized in order to determine the fixed points at which a kernel density estimate is constructed. The set G of all fixed points is a discrete set, allowing for use of the discrete transformation of variables technique.

Recalling equation (3.2) from Chapter 3, and replacing the true joint density estimate, f, with the kernel estimator, f , which has been evaluated for all (x',y') E G , we obtain the probability distribution for Z = u(X; Y)by

where S, = {(x', y')lz = u(x' ,y')) . In order to facilitate a convenient means for comparing the estimated C.D.F. of Z with simulations of the true C.D.F., the kernel density estimate values are converted into relative frequencies before (4.3) is applied.

Relative frequencies are obtained by multiplying each functional estimate, f (x' ,y') , by CHAPTER 5. EMPIRICAL RESULTS AND VALIDATION

A full-factorial, fixed effects experiment was designed to evaluate the numerical technique presented in the previous chapters. The objective of the empirical test was to evaluate the performance of the methodology under a number of different conditions.

Specifically, the factors chosen for evaluation were degree of correlation, type of input distribution, and number of observations taken by the analyst. In addition, the operations of addition, subtraction, multiplication, and division of the two correlated random variables were also considered as a factor.

5.1. Experimental Method

The experiment consisted of 144 test problems in which the levels of the aforementioned factors were varied. The resulting cumulative distribution functions of the estimation procedure were compared with the cumulative distribution functions of

Monte Carlo simulations of the sum, difference, product, and difference of two correlated random variables. The Monte Carlo simulations consisted of populations of

10,000 data points, providing an adequate approximation of the true population. The primary measure of performance was the maximum absolute error (MAE) between the cumulative distribution functions of the estimator and the simulated distribution. The Cramer-von Mises test [5] was employed to test the hypothesis that the two distribution functions were equal, while an analysis of variance (ANOVA) was used to determine which factors significantly affected the quality of estimation.

5.1.1. Experimental Design

A full-factorial, fixed effects design was used in this experiment. The dependent variable chosen was the maximum absolute error (MAE) in vertical distance between the cumulative distribution function of the estimator and the true (simulated) distribution.

This measure was chosen since it was desired to estimate the worst performance expected by the estimator. Furthermore, the mean squared error (MSE) was calculated to give some measure of the global discrepancy between the two cumulative distribution functions.

The four independent variables were the degree of correlation, the number of observations or sample size, the type of input distribution for X, and the type of arithmetic operation being performed on the two correlated random variables. Table 2> below gives a summary of each of the factors considered and their respective levels. Table 2. Factors and Their Respective Levels For the Designed Experiment Factor Levels Correlation ( R~ ) 0.90,0.60, 0.20 X Distribution Uniform, Normal, Triangular Number of Observations 50,100,200,400 Operation on X and Y Sum, Difference, Product, Quotient

It is hypothesized that the grid chosen to cover the region R could also have been considered as an independent variable, but remained fixed throughout testing in order to facilitate a software implementation of the procedure. Since the kernel estimator is based on the "distance" between the fixed point of consideration and each of the observations, altering the grid size could possibly improve the density estimate.

However, all results obtained herein are based upon a fixed grid.

The two random variables were both defined on the interval (0,l) and a positive correlation (high, medium, and low) was induced in each case. The random variables were defined on a positive, nonzero interval in order to demonstrate the performance of the technique on all four operations: addition, subtraction, multiplication, and division.

The grid was fixed on [0.02,0.98] in both coordinate directions in advance. The step size for the grid in both directions was 0.02. Thus, the density estimate was constructed with the 2401 points constituting the grid. Figure 10 gives a sample problem demonstrating

100 observations and the grid over the region. It should be noted that only the points on the interval [0.02,0.98] were considered in both coordinate directions. Figure 10. Grid Constructed On Correlated Data From Figure 8 ([0.02,0.98], X-U(O,I), n = 100 observations, R* = 0.90).

5.1.2. Experimental Procedure

When two random variables are correlated to some degree, a simple scatterplot of the data will indicate a linear relationship (either positive or negative). Consider the previous data in Figure 10 which are highly positively correlated (R~= 0.90). The plot indicates a linear relationship which is positively sloped. Thus, it is evident that one may generate correlated data sets by first creating a set of X values, and then generating the Y-values as a linear function of the X-values. Such Y-values may be generated as where a is the slope of some line and E is a random error term. In this work, the slope of the line was fixed at 1 and the random error term was assumed to be normally distributed with a mean of zero and variance which will be discussed later.

It was desired to generate Y-values as a linear function of the X-values with a given correlation, R' . Since the random error term is assumed to be normally distributed with a mean of zero, it was only necessary to determine the variance of the error term in order to obtain a desired correlation. A technique used by Bakshi [2] was used to determine the appropriate value for this variance. The technique is as follows:

1. Generate X-values according to the desired distribution.

2. Calculate the sample variance, s2 for the X-values generated.

3. Based upon s2, calculate the variance for the error term to obtain the desired

correlation.

4. Generate Y-values by equation (5.1 ).

Let R' denote the desired level of correlation between the random variables X and Y and let k2 denote the variance for the error term such that E - ~(0,k').Then this variance can be obtained by solving the following equation for k' . SimplifLing (5.2) yields

The reader should consult [2] for details on the validity of (5.3). Using this relationship, it was possible to generate data sets in which X follows three different distributions

(uniform, normal, and triangular) and Y s were generated as a linear function of X with a given level of correlation per (5.1). As previously mentioned, all variables were generated on the interval (0,l) and with positive correlation for consistency.

Case 1: X- U(0,l)

In the first case, it was desired to create (continuous) X values which were uniformly distributed on the interval (0,l). Three populations of 10,000 X values were generated, and the corresponding Y values calculated for each of three levels of correlation: 0.90, 0.60, and 0.20. From these populations of 10,000, random samples of

50, 100, 200, and 400 were drawn. Thus, for X - U(0,I), there exist 12 data sets, four corresponding to each level of correlation. The X values were generated using a linear congruential random number generator with internal shuffling which produces longer sequences without repetition. Details on the random number generator may be found in

[24]. Case 2: X - N(0.50,0.0049)

In the second case, the X values were drawn from a normal population. Because it was desired to have X and Y values on the interval (0,1), the normal population was generated with a mean of 0.50 and a very small variance of 0.0049. This was achieved by first generating 10,000 random numbers which are U(0,l) and then using the

NORMINV function provided in Microsoft Excel version 7.0. The NORMINV function takes the uniform random number on (0,1), a desired mean, and standard deviation as its arguments and returns the inverse of the normal cumulative distribution for the specified mean and standard deviation. Three populations of 10,000 X-values were generated and the corresponding Y-values calculated for each of three levels of correlation: 0.90, 0.60, and 0.20. From these populations of 10,000, random samples of

50, 100, 200, and 400 were drawn. Thus, for X- N(0.5,0.0049), there exist 12 data sets, four corresponding to each level of correlation.

Case 3: X- Tr(O.O,0.5, 1.0)

In the third case, the X values were drawn from a triangular population with minimum value 0.0, 0.50, and maximum value of 1.0. The technique for generating a triangular distribution is given in [17] and is outlined here. A right- triangular distribution on an interval (a, b), denoted RT(a, b), has probability density function given by 10 otherwise

and a left-triangular distribution on (a,b), denoted LT(a,b), has probability density function given by

lo otherwise

It can be shown that if X - RT(0,l) or X - LT(O,l), then the new random variable, X' given by

is RT(c,d) or LT(c,d), respectively. Thus, by simply generating two samples, one which was RT(0,l) and one which was LT(O,l), two new random variables were obtained by

(5.6). The first generated was RT(0,0.5) and the second generated was LT(0.50,l). By combining these two new populations. the overall population of X-values obtained is in fact triangular \\ith minimum value 0.0, mode 0.5, and maximum value 1.O. The t~o distributions RT(0,l) and LT(0,l) were generated by (5.4) and (5.5) above using the inverse transform technique. Three populations of 10,000 X-values were generated and

the corresponding Y-values calculated for each of three levels of correlation: 0.90, 0.60,

and 0.20. From these populations of 10,000, random samples of 50, 100, 200, and 400 were drawn. Thus, for X- Tr(O.0, 0.5, 1.0), there exist 12 data sets, four corresponding to each level of correlation.

For each data set generated, the performance of the technique was measured by

running a small software implementation of the procedure. A detailed description of the

software implementation and its use is given in Chapter 6. The user creates a data set of

given sample size which is in the form of two column vectors. The first line of such a

data file must contain the sample size itself (n) in order for the implementation to

function properly. Figure 11 gives the appropriate format for a data file used as input to the program for evaluation. As indicated by the first line of the data file, 50

observations were taken for this data set.

Figure 11. Sample Data File Used As Input For Evaluation Of Estimator

The user types, in full, the name of the data file and then selects the Qpe of operation to

perform on the two random variables from a menu. The resulting output containing the finite set Q of possible z-values and their corresponding probabilities and cumulative probabilities, is written to an output file entitled output.ok. Figure 12 is a sample output taken from the program showing the results of the estimator.

Output for file : tdat 9a. dat Number of Observations: 50

Figure 12. Sample Output File output.out From Evaluation Program

The cumulative distribution functional values were then compared with the cumulative distribution functional values of the simulated population at each z E Q .

The absolute error was calculated for each point and the maximum of this set was the major performance measure of concern. Let S, be the "empirical" distribution function for the simulated population and S, be the empirical distribution function for the estimated population. The performance measure of maximum absolute error (MAE) is given by MAE = ma$, (I)- S, (r)l ~9

while the performance measure of mean squared error (MSE) is given by

1 MSE = -C {s,(z) - S, (z)}' n: 2,

where n, is the number of points in the set Q.

5.2. Evaluation of Empirical Results

5.2.1. Tests of Goodness-of-Fit

Goodness-of-fit tests were necessary in this research to determine how "well" the estimated probability distributions fit the simulated (true) distribution. The standard technique for performing this type of analysis is to utilize goodness-of-fit tests such as the Kolmogorov-Smirnov test or the Cramer-von Mises test. Both tests are two-sample, nonparametric tests for the goodness-of-fit of distributions, however, there do exist some subtle differences. The Kolmogorov-Smimov two-sample test (Smirnov test) is used to determine if two independent samples have been drawn from the same population [29]. When the two-tailed test is employed, it is sensitive to any differences between the distributions and the populations from which they originated. In a sense, if it is found that two samples come from the same population, then they must have the same distribution.

The Cramer-von Mises two-sample test directly determines if the two distributions are the same, and was therefore used to test the goodness-of-fit of the distribution functions for the 144 test cases. Let F denote the true distribution function for the simulated random variable, Z = u(XY) and let G denote the true distribution function for the estimator. The hypothesis is that the two distribution functions are the same for all z and the alternative is that they are not the same for at least one z.

Formally stated, the following hypothesis was tested for each case:

H,:F(z) = G(z) for allz H, :F(z) # G(z) for at least one z

The data consists of two independent random samples with unknown distribution functions F and G. Denote these two random samples by Z,,,Z,,,..., Z,, and

Z,, ,Z,, ,..., Z,, . Let S,(z) and S, (z) denote the empirical distribution functions for the two samples. Then the two-tailed test statistic for Cramer-von Mises [5]is given by Since both distribution functions are evaluated at each z E Q (the finite set of all possible z-values), and the number of z-values is fixed at nZ,it is clear that m = n = n, and furthermore, Z,, = Z2,, i= 1,2,... n, . Thus (5.9) simplifies with some algebra to

The decision rule is to reject the null hypothesis at the approximate level, a, if T, exceeds the 1-a quantile, wl-, . Throughout the analysis, the null hypothesis was tested at the 0.05 level. The critical value for this level is w,, = 0.461 . For the, exact distribution of the Cramer-von Mises test statistic, the reader is referred to Conover [5].

In 144 cases, it was found that 11 cases were significant at the 0.05 level. These cases were found to occur only on the operations of multiplication and division of the random variables. Results of all cases can be found in Appendix A. 1 and plots of the estimated versus actual C.D.F.'s are given in Appendix A.2. Figure 13 demonstrates the

accuracy of the approach on a sample problem.

Figure 13. Comparison Between Estimator C.D.F. and Simulated C.D.F. For Product When X is Distributed Normally and R* = 0.90.

5.2.2. The Analysis of Variance

An analysis of variance (ANOVA) was performed to determine which of the four factors (A = degree of correlation, B = number of observations, C = distribution type, and D = operation) contributed significantly to the error in estimation. The designed experiment was a full-factorial, fixed effects model with a single replicate. Due to the fact that only a single replication was performed, no degrees of freedom remain in the

ANOVA model for an estimate of the error variance. One procedure for dealing with this problem is to pool the higher-order interactions [22]. This may be done if the calculated mean squares of the higher-order interactions are small relative to the main and second-order interactions. By running a full factorial experiment, the mean squares for each factor and all interactions were determined and plotted on the Pareto chart given in Figure 14.

A = Correlation - B = No. of Observations

.- - -- -.------A . . . C = Distribution Type -- - D = Operation -

-

-

-

- -

Effect Figure 14. Pareto Chart Of Mean Squares For Each Effect In ANOVA Model. It is noted that the fourth-order interaction may be considered as negligible when compared with the first and second order interactions. The resulting degrees of freedom for estimation of the error variance is 36. One potential problem in pooling the higher- order interactions is that this practice may lead to an underestimation of the mean squares of the residuals. Thus, the probability of a Type I error may increase when testing the main and second-order effects. In this experiment, only the fourth-order interaction was used for estimation of the error variance.

The ANOVA model given in Table 3 indicates that three of the four main effects

(degree of correlation, number of observations, and distribution type) are hghly significant ( P - value 5 0.000 1) at the 0.05 level. All second order interactions were found to be significant with the exception of the interaction between number of observations and type of operation. Two of the four third-order interactions were found to be significant, those being the interaction between correlation, observations, and distribution type and the interaction between correlation, distribution type, and operation. Table 3. ANOVA Table, F-Values Given At The 0.05 Level. Source df SS MS F ,, F-critical P- Value % Variation A 2 0.017 0.00850 25.500 3.266 0.0001 10.897 B 3 0.012 0.00400 12.000 2.872 0.0001 7.692 C 2 0.017 0.00850 25.500 3.266 0.0001 10.897 D 3 0.006 0.00200 6.000 2.872 0.0022 3.846 AB 6 0.007 0.00117 3.500 2.372 0.0073 4.487

Total 143 0.156

Each of the four main effects will now be discussed in greater detail.

Factor A: The Effect of Correlation

As previously indicated, the factor of correlation was found to be highly significant in the ANOVA model. Thus, the degree of correlation has a statistically significant effect on the maximum absolute error of estimation of the true "population"

C.D.F. Figure 15 shows the effect of increasing the level of correlation on the average maximum absolute error. 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation (R-squared)

Figure 15. The Effect of Correlation on Maximum Absolute Error.

It is evident from Figure 15 that as the correlation increases, the maximum absolute error also increases. A Scheffe post-hoc test (Appendix A.3) indicates a statistically significant difference between the high correlation of 0.90 and the mid and low correlation levels of 0.60 and 0.20, respectively at the 0.05 level.

Some significant interaction effects between correlation and several other factors appear to be present as well. Specifically, there appears to be a significant interaction between the correlation and the distribution. Figure 16 plots the average maximum absolute error versus the distribution type. It is evident that overall, the estimator performed worst on the normal cases and best on the triangular cases. It is further noted that there is little difference between the performance on the triangular and uniform cases.

Uniform Normal Triangular X Distribution

Figure 16. The Effect of Distribution Type on Maximum Absolute Error.

This trend is apparent when the interaction between correlation and distribution is considered. Figure 17 gives the interaction plot for the correlation and distribution type interaction. 012 -

I I

1

i > M I d I -0.2 -0.6 0 02 I .---- I 1 *o 9 1 1 000 1 Unlform Normal Triangular X Distribution

Figure 17. Interaction Plot for Correlation and Distribution Type.

Clearly, the estimator performs most poorly on the at higher correlations.

The interaction between correlation and operation was also found to be significant at the 0.05 level. Specifically, at the high correlation level, the operation type tends to have a significant effect on the estimate, while at the medium and low correlations levels, the estimate is fairly invariant. Figure 18 demonstrates these observations. ".V" Sum Difference Product Quotient Operation

Figure 18. Interaction Plot for Correlation and Operation.

The effect of operation will be considered fully in the following sections; however, it should be noted here that the trend seen in Figure 18 for the high correlation level is typical of the overall effect of the operation. That is, the estimator demonstrates more accuracy on the sum and product estimates versus the estimates of the difference and quotient.

The interaction between correlation and sample size also contributes significantly to the error in estimation. When the correlation is fixed at the medium level

( R' = 0.60), the error is nearly invariant across all sample sizes. For the other two levels of correlation, the error slightly decreases as n increases. Furthermore, an interesting observation is made. The error actually increases when n is increased from

50 to 100 observations. Figure 19 demonstrates this odd phenomenon which is apparent for the sample size in general.

0.12

0 10 eL $ 008 -a -P P ' 0.06 .-I 2 004 4e 0 02 I

0 00 50 100 200 400 Number of Observations

Figure 19. Interaction Plot for Correlation and Number of Observations (n).

In general, taking into account all interactions, the performance of the estimator, based on maximum absolute error, worsens as the degree of correlation is increased. However, as indicated by goodness-of-fit tests, the estimation remains acceptable in nearly all Factor B: The Effect of Sample Size

The previous figure demonstrated the effect of the correlation and sample size interaction. Here, the effect of sample size will be considered independently initially and then its interaction with the distribution type and operation will be considered. First consider the plot of average maximum absolute error versus sample size in Figure 20.

Number of Observations

Figure 20. The Effect of Number of Observations (n) on Maximum Absolute Error.

The estimator exhibits a gradual decrease in average maximum absolute error as the

number of observations increase. It was hypothesized that the estimation error would

monotonically decrease with the sample size. however Figure 20 shows a slight increasz in the error at 100 observations. This jump may occur due to an interaction between the grid size, which was fixed, and the number of observations taken.

The estimator performed best with 400 observations when X followed a triangular distribution. For both the uniform and triangular distributions, the error tended to decrease as sample size increased, however, this was not true for the normal distribution, as indicated in Figure 2 1.

Number of Observations

Figure 21. Interaction Plot for Number of Observations (n) and Distribution Type.

When considered against all operations, the effect of sample size on maximum absolute error is fairly unchanged at 50 and 100 observations. while the MAE exhibits sporadic behavior across operations at 200 and 400 operations. The behavior is typical of the trend seen for the operations in general. More specifically, the estimator consistently performs better on the sums and products of the random variables than on the difference and quotients. Consider Figure 22 which shows the behavior of the estimator under the different operations with varying sample sizes.

0.00 0 50 100 150 200 250 300 350 400 450 Number of Observations

Figure 22. Interaction Plot for Number of Observations (n) and Operation.

A Scheffe post-hoc test (Appendix A.3) shows a significant difference in the

MAE between 50 and 400 observations, 100 and 400 observations, 50 and 200 observations, and finally, 100 and 200 observations. Factor C: The Effect of Distribution Type

Figure 16 showed the overall effect of distribution on MAE. It is clear that the estimator performs best on the triangular distribution and worst on the normal distribution. It was seen that this trend of poor performance on the normal distribution was consistent when its interaction with the sample size and correlation were considered.

Figure 23 demonstrates that when operations are considered, the estimator still performs worst on the normal distribution.

- - I 1 I I I Uniform

I

I

Sum Difference Product Quotient Operation

Figure 23. Interaction Plot for Distribution Type and Operation. A Scheffe post-hoc test indicates a statistically significant difference between the performance of the estimator on the normal distribution and the uniform and triangular distributions, at the 0.05 level.

Factor D: The Effect of Operation

The factor of operation type (sum, difference, product, or quotient) was significant in the experiment and was also shown to be significant when interacting with some of the other main effects such as correlation and distribution type. Figure 24 plots the average MAE against the main effect of operation type.

0.00 Sum Difference Product Quotient Operation

Figure 24. The Effect of Operation on Maximum Absolute Error. The estimator exhibits slightly better performance on the sum and product of the two

random variables versus the difference and quotient. Scheffe post-hoc tests indicate a

statistically significant difference between the sum and quotient and between the sum and difference at the 0.05 level.

Intuitively, it seems reasonable that the degree of correlation and sample size would be significant factors. However, in terms of the distribution type, it was not expected that the estimator would perform worst on the normal distribution. The relatively poor performance on the normal cases may be attributed to two potential problems. First, the generation of normal variates on the interval (0,l) was achieved by setting the mean of the distribution to 0.50 and shrinking the variance such that all samples fell in the chosen interval. This small variance term implies a small variance for the normal error term used to generate the Y observations. Thus, the resulting scatterplot of observations appears very clustered. Secondly, it is noted that one slight drawback in using the KDE is its problems with long-tailed distributions. The estimator tends to cause noise in the tails since the smoothing parameter is fixed throughout the estimate.

If the smoothing parameter is adjusted so that this noise is "smoothed out" in the tails, then it is possible that the details in the main portion of the distribution will be ludden. 5.2.3. The Estimator Versus Enumeration

In addition to the testing mentioned in the previous section, a small test was performed to investigate the amount of improvement gained by increasing the number of observations from 50 to 400. The problem chosen for evaluation was the sum Z = X + Y when X is normally distributed with a high correlation ( R~ = 0.90 ). From the generated population of 10,000 pairs, 100 samples of 50 pairs each were drawn for the first case and 100 samples of 400 pairs were drawn for the second case. For each of the 100 samples drawn in both cases, the maximum absolute error was found for the estimation method and for simple enumeration. The cumulative distributions of these errors were then plotted against one another. Figure 25 shows the results for the case when n = 50. Maximum Absolute Error

Figure 25. Comparison of MAE C.D.F.'s for Estimator and Enumeration (n = 50, X is normally distributed, R* = 0.90, Sum).

Thus, for 100 random samples (of 50 pairs each) drawn from the original population of

10,000, it is seen that the estimation method clearly outperforms simple enumeration.

Figure 26 shows the resulting C.D.F.'s in the case when n = 400. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Maximum Absolute Error

Figure 26. Comparison of MAE C.D.F.'s for Estimator and Enumeration (n = 400, X is normally distributed, R~ = 0.90, sum).

Obviously, when the number of observations is increased from 50 to 400, the distribution of maximum absolute error shifts to the left for both methods. Note that the "gap" between the two distributions vanishes as n increases from 50 to 400. Thus, a simple enumeration of 400 observations yields results at least as good as the results obtained by using the estimation technique. CHAPTER 6. SOFTWARE IMPLEMENTATION

Implementation of the estimator presented in the previous chapters was achleved by a small software package developed for the purpose of testing the procedure. The program will also be used by researchers in the Human Factors and Ergonomics Laboratory at

Ohio University to obtain good estimates for the probability distribution of functions of correlated random variables. The name of the software program is CORREST. This chapter shall give an overview of the requirements, use, and limitations of CORREST.

6.1. Requirements of the Software Implementation

CORREST is a Microsoft DOS-based C-program which requires the user to supply the real-world observations or samples from theoretical distributions as inputs to the program. In return, the program gives a finite set of z-values, their probabilities, and their cumulative probabilities in tabular form in an output file called output.out. The program contains 388 lines of source code. Although CORREST is not Windows-based, the program requires very little effort on the part of the user, with the exception of supplying the input data. To effectively utilize the program. the following hardware. software, and data input requirements should be met: 1. Hardware: An IBM-compatible computer with at least a 486 DX processor and a

minimum of 8 Mb of RAM.

2. Software: MS-DOS 5.0 or higher is recommended.

3. May be run through Windows, but this may cause some memory problems. It is

recommended that the program be run at the MS-DOS prompt by typing the

executable filename.

4. Must supply the program with an input file which contains the number of

observations and the actual data points.

6.2. Using the Software Package

Although CORREST is not Windows-based, the effort required by the user is minimal. Knowledge of the kernel density estimator and the transformation of discrete variables is not necessary to obtain the desired density estimates. Once the input data has been provided for CORREST, the analyst must simply enter the type of operation helshe would like to perform on the correlated random variables (i.e., addition, subtraction, multiplication, or division). The following section explicitly describes the steps involved in running program CORREST to obtain an estimate of the probability distribution of functions of two correlated random variables: 1. Create a data file according to the appropriate format: The input data file must be a

tab-delimited text file which may be created in a spreadsheet application or in MS-

DOS. The first line of the file must contain the number of observations in the data

file followed by the observations themselves which are arranged in adjacent columns.

Figure 27 gives an example of an appropriate input data file for program CORREST.

Figure 27. Sample Input Data File For Program CORREST.

The first line of the file indicates that this file contains 50 observations.

2. The executable file, correst.exe should be copied fiom a floppy disk to a

subdirectory on the hard drive of the computer in which the user wishes to run

CORREST.

3. From within this subdirectory, at the prompt, type correst.exe or simply correst.

Either command will start the executable file to run the program.

4. The user will now be prompted to enter the name of the input data file which must be

entered in full. For example, if the name of the file is test.dat then the user must

enter the filename and its extension (i.e.. the user would t)pe test.dat). Furthermore, the data file must be located in the same local directory as the executable file in

order for the program to function. Once the input data file name has been entered,

the program displays (on screen) the value of the optimum smoothing parameter, h, ,

the covariance matrix of the data, the inverse of the covariance matrix of the data,

and the determinant of the covariance matrix of the data.

5. The user is now prompted for the operation to perform on the random variables. In

order to select an operation, the user inputs the number of hidher choice and hits

enter.

6. The program will perform its calculations and write the output to a file entitled

output.out. The file can be found in the same local directory as the executable file.

Figure 28 gives a portion of an example output file from CORREST. The file may be read by Microsoft Excel or another spreadsheet package in order to plot the density and distribution functions. Program CORREST has no graphics capabilities. Figures 29 and

30 give flow charts for the program flow and calculation flow. Output for file: tdat9b.dat Number of Observations: 100

Figure 28. Sample Output Data File From Program CORREST. Check Invalid Exit Input File

1 Valid

I Enter Number of O~erationto Perform

Perform Calculations

Send z ,h(z) , and H(z) I To outoutoul

*,Program

Figure 29. Flow of Program CORREST. EY=lPro ram

-1 -1 Read-in n and

Prompt User for Division by zer Exit O~eration(u ) Program Operation OK Calculate O = set of

Send Output to File/ eExit Pro ram Figure 30. Flow of Calculations for program CORREST 6.3. Limitations of the Program

This program was designed for estimating the probability distributions of functions of two correlated random variables. In its current form, CORREST cannot solve a problem of higher dimension. Some limitations do exist for the program, even in the two-variable case.

More specifically, the maximum number of real observations cannot exceed 400 and the maximum number of grid points to be evaluated should not exceed 240 1. This corresponds to dividing the intervals in both coordinate directions into 50 equally-spaced cells. Thus, the rule for grid selection given in Chapter 4 is used in this implementation.

The type of input data file used in the program must be of the form discussed in the previous section. That is, it must be a tab-delimited text file containing the number of observations on the first line of the file followed by two adjacent columns which correspond to the observation pairs.

No graphics capabilities are included in the program, however, if graphical displays of the resulting distributions are required, the output file may be opened and read manually by Microsoft Excel or any other common spreadsheet application. CHAPTER 7. CONCLUSIONS, RECOMMENDATIONS, AND FUTURE WORK

The objective of this research was to develop a numerical technique to approximate the probability distribution for simple functions of two correlated random variables. Such a methodology will allow engineers to include in their analyses the inherent linear dependencies which so often exist in real-world problems. This objective was achieved through a two-phase process which consisted of, 1) estimating the joint probability density function using the kernel density estimator and 2) applying the discrete transformation of variables to the density estimate to obtain an approximated distribution. The technique was tested against Monte Carlo simulations of probability distributions of 10,000 observations for the sum, difference, product and quotient over a wide range of problems. This chapter shall summarize the results of this research, make relevant recommendations about the use of the technique, and propose future research for improvements and extensions of this work. 7.1. Conclusions

The statistical results presented in Chapter 5 demonstrate that given a good estimate of the joint probability density function, the distribution of the sum, difference, product, and quotient of two correlated random variables may be closely approximated by applying the discrete transformation of variables technique on a finite set of points in the region. The kernel density estimator provides the means by which the joint probability density function may be approximated and it has been demonstrated that even with a relatively small number of observations, good overall approximations of distributions of functions of random variables may be achieved.

The method was tested on a wide range of problems of varying correlation, sample size, and distribution type. Each problem was tested on the sum, difference, product, and quotient of the two correlated random variables for a grand total of 144 test cases. Overall, the average maximum absolute error (MAE) of estimation when compared with simulated C.D.F.'s was found to be 0.061 while the average mean- squared error (MSE) of estimation was found to be 0.000673. Furthermore, 93% of the estimates were statistically found to be the same distribution as the simulated distribution.

The major factors affecting the quality of the estimation were found to be the degree of correlation, the sample size, and the type of distribution involved. With regard to correlation, the estimate worsens as the degree of correlation increases. The performance of the technique on the uniform and triangular distributions was very similar (average MAE = 0.054 and average MAE = 0.051, respectively) while the technique performed worst when the X distribution was normal (average MAE = 0.076), possibly for the reasons stated in Chapter 5.

The estimator was found to perform best in terms of average maximum absolute error on the sum (average MAE = 0.053) and product (average MAE = 0.056) of two correlated variables while the difference (average MAE = 0.067) and quotient (average

MAE = 0.066) performed slightly more poorly. As expected, there is a general decrease in maximum absolute error when the sample size increases, however, this decrease is very slight. Furthermore, at 100 observations, there is a jump in the error. This may be attributed to a relationship with the grid size which was fixed throughout testing.

7.2. Recommendations

The numerical technique has been shown to provide good estimates for the probability distributions of the sum, difference, product, and quotient of two correlated random variables. Under conditions of low, medium, and high correlation, the estimation method provides more than adequate estimates of the overall probability distribution of functions of the two correlated random variables. It is recommended that the technique be used primarily as a tool when the

engineer has collected a set of real-world observations and wishes to determine the

overall distribution for some function of two random variables, for example their sum.

By beginning with a scatterplot of the data, it is possible to identi@ if a linear

dependency exists between the two variables. If a linear dependency is present, the

analyst is very much justified in using the technique.

Obviously, the number of observations taken by the engineer is an important

factor in determining whether or not the estimation method needs to be used as opposed

to simply enumerating the possible Z values and constructing a cumulative distribution.

As seen in Chapter 5, when the number of observations was as large as 400, simple

enumeration was found to perform as well as the estimation procedure. However, at a

low number of observations, the estimator clearly outperforms enumeration. Hence, it is

recommended that if fewer than 100 observations are taken, the estimation procedure

should be implemented. If more than 100 observations are available, the analyst may just as easily perform a simple enumeration of the possible z-values and construct an

estimate of the distribution. These recommendations are based upon tests using the

normal distribution with a high degree of correlation.

In some cases, engineers may need to assume a theoretical distribution for one or

more of the variables because collection of real-world data is not feasible. This

technique is not limited to real-world observations. The estimator only requires data

points and is not concerned with their origin. Furthermore, the estimator has provided good results on the uniform, normal, and triangular distributions, all of which are commonly used for the modeling of engineering data. Although the estimator exhibits relatively higher errors on the normal distribution, its performance (average MAE =

0.076) is still quite acceptable.

7.3. Improvements and Future Research

7.3.1. Improvements to Current Research

There exist a few areas in which the methodology could be improved in the two- variable case. First, the grid of points at which the joint density was estimated was fixed in each case. It is hypothesized that there is a relationship between the chosen grid size and the nurnber of observations taken. It may prove useful in future research to investigate the effect of a "tighter" or "looser" grid with a varying number of observations.

Second, it was noted that the kernel density estimator has some difficulty on long-tailed distributions. This could be the reason that the estimator worked best on the uniform and triangular distributions and performed worse on the normal distribution. In future research, it may be worthwhile to investigate the adaptive kernel method as proposed by Silverman [30]. This method adjusts the smoothing parameter, h, throughout the estimation procedure rather than using a fixed parameter. In terms of the computer implementation, CORREST, the program has not been

analyzed to determine areas in which efficiency may be improved. Thus, in its current

form, CORREST uses a "brute force" approach for generating the probability distribution

estimates. It may be possible to analyze a larger sample size or a tighter grid if these

areas are improved. Furthermore, it would be desirable to have a Windows-based package which gives graphical output of the resulting probability density and cumulative

distribution functions.

7.3.2. Future Research

This work has provided the framework for the construction of more complicated

functions of random variables. Specifically, the work may be extended to three or more

correlated variables. The steps one would perform to arrive at the probability

distribution estimate for functions of three correlated random variables are presented here.

Let X, ,X, , and X, be three random variables which are all correlated and our

interest is to estimate the probability distribution of the random variable

Z = u(X,,X,, X,) . The user will still be required to provide a set of n-observations.

Furthermore, a "grid of points upon which the joint density function, f (x, ,x2,x,) is

estimated, must still be defined. As before, define R = ((x,,,x2,, x,, )li = 1,2,.. ,n) as the

set of n real observations for each of the three variables such that = (x,,.~,-I x,,) *,* denotes the i' observation. Furthermore, define G = {(x,, ,x,, ,x,, )I i = 1,2,.. ,m] as the set of all fixed points where m is a positive integer denoting the number of points at which the density estimate will be constructed. The sample space of the resulting

84, random variable Z shall be denoted by Q = {rlz = u(x,, ,x,, ,x,,), j = 1,2 ,..., m} . The set

Q will contain nZ elements. Thus, the joint density estimate for a point S E G is given by

where Z represents the covariance matrix for the data, 1x1 is its determinant, and Z-' is its inverse. All other variables are defined as previously. It should be noted that C is

s,< ,,I now a 3 x 3 matrix. As before, define S, = {(x, ,x, ,x,)lz = u(x, ,xi, x, )) as the set of all points which give the value z when mapped into Q. Then the probability that the random variable Z assumes the value z may be given by Before the above summation is made, the joint density functional values should be converted to relative frequencies, as was done in the two-variable case.

Attempting an extension to three or more dimensions requires a great deal more computational effort in comparison to the two-variable case. First, it is necessary to compute the covariance matrix based on the observations of all three columns of data.

Second, once the covariance matrix has been determined, its inverse and determinant must be computed. These routines are generally made available in references such as

[241

The real complication which may occur in the implementation of higher dimensions is in specification of the set G. Using the simplistic rule given in Chapter 4, it is seen that the number of points in the set G for estimating the joint density function is given by

where p is the number of divisions in each coordinate direction and d is the number of variables in the problem. Thus, as the number of variables grows, the number of points being evaluated for the joint density grows exponentially. In this work, p was fixed at 50 so that in the two variable case, the joint density was estimated at 2401 points. Using the same grid rule for three variables would require evaluation at 117,649 points. Currently, the CORREST program is capable of solving the case d = 2. This is an area which could be improved upon in the hture. REFERENCES

[l] Aroian, L.A., Some Methods of Evaluation of a Sum, Journal of the American Statistical Association, Vol. 39, 1944, pp. 5 11-5 15.

[2] Bakshi, Girish, Comparison of Ridge Regression and Neural Networks in Modeling Multicollinear Data, Master's Thesis, Ohio University, Athens, Ohio, Nov., 1996, pp. 40-50.

[3] Breitenberger, Ernst, Probability, Convolutions, and Distributions. Electric Power Research Institute, Palo Alto, CA, 1990, pp. 69-84.

[4] Colombo, A.G. and R.J. Jaarsma, A Powerful Numerical Method to Combine Random Variables, IEEE Transactions on Reliability, Vol. R-29, No. 2, June, 1990, pp. 126-129.

[5] Conover, W.J., Practical Nonparametric Statistics, John Wiley & Sons, Inc., New York, 1971, pp. 3 14-3 16.

[6] Cover, T.M. and Zhen Zhang, On the Maximum Entropy of the Sum of Two Dependent Random Variables, IEEE Transactions on Information Theory, Vol. 40, July, 1994, pp. 1244-1246.

[7] Craig, C.C., On the Frequency Distributions of the Quotient and of the Product of Two Statistical Variables, American Mathematical Monthly, Vol. 49, pp. 26-32.

[8] Cramer, H., Random Variables and Probability Distributions, Cambridge Tracts in Mathematics and Mathematical Physics, No. 36, Cambridge University Press, 1962.

[9] Curran, M., Valuing Asian and Portfolio Options By Conditioning on The Geometric Mean Price, Management Science, Vol. 40, December, 1994, pp. 1705-171 1.

[lo] Epstein, B.. Some Applications of the Mellin Transform in Statistics, Annals of Mathematical Statistics, Vol. 19, 1948, pp. 370-379. [l 11 Freund, John and Ronald Walpole, Mathematical Statistics, Prentice Hall, Englewood Cliffs, New Jersey, 1980, pp. 225-244, 248,249.

[12] Fukunaga, Keiosuke, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1972, pp. 26-33,92,93.

[13] Giffin, W.C., Transform Techniquesfor Probability Modeling, Academic Press, Inc., New York, 1975, pp. 22-44, 70-73.

[14] Giffin, W.C., Introduction to Operations Engineering, Richard C. Irwin, Inc., Homewood, Illinois, 197 1, pp. 96-1 04.

[I 51 Hu, X., Probability Modeling of Industrial Situations Using Transform Techniques, Master's Thesis, Ohio University, Athens, Ohio, Nov., 1995.

[16] Johnson, Richard and Dean Wichern, Applied Multivariate Statistical Analysis, Prentice Hall, Englewood Cliffs, New Jersey, pp. 80, 8 1, 340-366.

[17] Law, A.M, and W. David Kelton, Simulation Modeling and Analysis, Second Edition, McGraw-Hill, Inc., New York, 1991, pp. 485-5 16.

[18] Leon, Steven J., Linear Algebra with Applications, Macmillan Publishing Co., New York, 1990, pp. 252-257, 337,339, 394-396.

[19] Lukacs, E., Characteristic Functions, 2nd Edition, Hafner, New York, 1970.

[20] Martel, A. and J. Ouellet, Stochastic Allocation of a Resource Among Partially Interchangeable Activities, European Journal of Operational Research, Vol. 74, May 5, 1994, pp. 528-539.

[21] Martin, Clarence H., Biased Estimation of the Parameter Vector in the General Linear Model, Master's Thesis, Ohio University, Athens, Ohio, August, 1974, pp. 77-80.

[22] Neter, J., Kutner, M.H., Nachtsheim, C.J., and W. Wasserman, Applied Linear Statistical Models, Richard Irwin, Inc., Chicago. 1989, pp. 124 1-1245.

[23] Pratt, John W. and Jean D. Gibbons, Concepts of !Yonparametric Theory, Springer- Verlag, New York, 198 1, pp. 320-334. [24] Press, Teukolsky, Vetterling, and Flannery, Numerical Recipes in C, Second Edition, Cambridge University Press, 1992, pp. 274-286.

[25] Rinott, Y., On Normal Approximation Rates For Certain Sums of Dependent Random Variables. Journal of Computational and Applied Mathematics, Vol. 55, Nov. 21, 1994, pp. 135-143.

[26] Rothschild, V. and N. Logothetis, Probability Distributions, John Wiley & Sons, New York, 1986.

[27] Saunders, Mark S. and Ernest J. McCormick, Human Factors in Engineering and Design, Seventh Edition, McGraw-Hill, Inc, New York, 1993, pp. 41 5-423.

[28] Scott, David W., Multivariate Density Estimation, Theory, Practice, and Visualization, Wiley & Sons, Inc., New York, 1992, pp. 125-190.

[29] Siegel, Sidney, Nonparametric Statistics for the Behavioral Sciences, McGraw- Hill, New York, 1956, pp. 127-136.

[30] Silverman, B.W., Density Estimationfor Statistics and Data Analysis, Chapman and Hall Ltd., New York, 1986, pp. 7-32.

[3 11 Springer, M.D., The Algebra of Random Variables, John Wiley & Sons, Inc. New York, 1979, pp. 67-72, 15 1-156.

[32] Springer, M.D. and W.E. Thompson, The Distribution of Products of Independent Random Variables, SUMJournal of Applied Mathematics, Vol. 14, 1966, pp. 5 11-526.

[33] Trivedi, Kishor S., Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1982, pp. 54-58, 109-1 1 1.

[34] Wetherill, G.B., Regression Analysis with Applications, Chapman and Hall, New York, 1986.

[35] Wintner, A., On the Addition of Independent Distributions, American Journal of Mathematics, Vol. 56, 1934, pp. 8-16.

[36] Wolff, Ronald W., Stochastic Modeling and the Theory of Queues, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1989, pp. 23-26. [37] Wylie, C.R., Advanced Engineering hfathematics, McGraw-Hill, New York, 1960, pp. 245-334.

[38] Yu,Jing, A Transform Technique for Obtaining Reliability Distributions for Multi- Component Systems, Master's Thesis, Ohio University, November, 1990. APPENDICES APPENDIX A.l

Experimental Results Table A.1.1. Experimental Results for Z = X + Y when X - U(0,1), n = 50 Correlation MSE MAE 7-2 w 0.95 Signijcant 0.9 0.00339 0.09408 0.16630 0.46100 NO 0.6 0.00300 0.08889 0.14680 0.46100 NO 0.2 0.00060 0.04359 0.02943 0.46100 NO

Table A.1.2. Ex~erimentalResults for Z = X + Y when X - U(0.1). n = 100 Correlation MSE MAE T2 w 0.95 Signijicant 0.9 0.00101 0.04844 0.04931 0.46100 NO 0.6 0.00086 0.06343 0.0421 8 0.46100 NO 0.2 0.001 50 0.07081 0.07364 0.46100 NO

Table A.1.3. Correlation

0.6 Table A.1.5. Experimental Results for Z = X - Y when X - U(0,1), n = 50 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00043 0.07609 0.02087 0.46100 NO

Table A.1.6. Experimental Results for Z = X - Y when X - U(0,1), n = 100 Correlation MSE ME T2 w 0.95 Significant 0.9 0.00022 0.05342 0.01044 0.46100 NO

Table A.1.7. Ex~eIiInentalResults for Z = X - Y when X - U(0.1:. n = 200 Correlation MSE ME T2 w 0.95 SignlJicant 0.9 0.00028 0.06007 0.01361 0.46100 NO 0.6 0.00073 0.07355 0.03520 0.46100 NO 0.2 0.00186 0.09358 0.09041 0.46100 NO

Table A.1.8. Experimental Results for Z = X - Y when X - U(O, I), n = 400 Correlation MSE ME T2 w 0.95 Signzficant 0.9 0.00012 0.03880 0.00567 0.46100 NO 0.6 0.00047 0.055 13 0.02261 0.46100 NO 0.2 0.00015 L 0.02777 0.00721 0.46100 NO Table A.1.9. Ex~erimentalResults for Z = XY when X - U(0.1). n = 50 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00389 0.09276 1.51482 0.46100 YES 0.6 0.00350 0.09135 1.36476 0.46100 YES 0.2 0.00082 0.04528 0.3 1781 0.46100 NO

Table A.1.10. Ex~erirnentalResults for Z = XY when X - U(0.1). n = 100 Correlation MSE ME T2 0.9 0.00137 0.05635 0.53489 0.6 0.00098 0.06208 0.38145 0.2 0.00163 0.06589 0.63623 0.461001 YES I

Table A.1.11. Experimental Results for Z = XY when X - U(0,1), n = 200 Correlation MSE MAE T2 w o.9s Significant 0.9 0.00035 0.03380 0.13768 0.46100 NO 0.6 0.001 10 0.0581 8 0.42785 0.46100 NO 0.2 0.00094 0.04804 0.36545 0.46100 NO

Table A.1.12. Experimental Results for Z = XY when X - U(0,1), n = 400 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00059 0.03864 0.22982 0.46100 NO 0.6 0.00072 0.04790 0.27856 0.46100 NO 0.2 0.00101 0.04730 0.39336 0.46100 NO Table A.1.13. Experimental Results for Z = X/Y when X - U(0,1), n = 50 Correlation MSE MAE T2 w 0.95 SigniJicant 0.9 0.00038 0.07609 0.2881 9 0.46100 NO

Table A.1.14. Experimental Results for Z = X/Y when X - U(0,1), n = 100 I I Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00034 0.05400 0.25593 0.46100 NO

Table A.1.15. Experimental Results for Z = X/Y when X - U(0,1), n = 200 Correlation MSE MAE T 2 w 0.95 Sign$cant 0.9 0.00016 0.04061 0.12039 0.46100 NO 0.6 0.00047 0.06135 0.35767 0.46100 NO 1 0.2 1 0.00225 1 0.08548 1 1.70422 1 0.46 100 1 YES 1

Table A.1.16. Experimental Results for Z = XiY when X - U(0,1), n = 400 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00007 0.05654 0.053 15 0.46100 NO 0.6 0.00036 0.04263 0.27173 0.46100 NO 0.2 0.0002 1 0.03046 0.15779 0.461 00 NO Table A.1.17. Experimental Results for Z = X + Y, when X - N(0.5,0.0049), n = 50

Correlation MSE MAE T2 w 0,95 Signijicant 0.9 0.00036 0.05476 0.01753 0.46100 NO 0.6 0.00025 0.04985 0.01226 0.46100 NO 0.2 0.00059 0.06275 0.02870 0.46100 NO

Table A.1.18. Experimental Results for Z = X + Y, when X - N(0.5,0.0049), --n= 100--- Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00034 0.05895 0.01650 0.46100 NO 0.6 0.0001 8 0.04562 0.00867 0.46100 NO 0.2 0.001 10 0.08063 0.05382 0.46100 NO

Table A.1.19. Experimental Results for Z - X + Y, when X - N(0.5,0.0049), n = 200 Correlation MSE MAE Tz w 0.95 Significant 0.9 0.00006 0.02026 0.00285 0.46100 NO 0.6 0.00080 0.07441 0.0391 1 0.46100 NO 0.2 0.00065 0.05553 0.03 180 0.46100 NO

Table A.1.20. Experimental Results for Z = X + Y, when X - N(0.5,0.0049), n = 400 Correlation MSE MAE T2 w 0.95 Signijicant 0.9 0.0003 1 0.05092 0.01 525 0.46100 NO 0.6 0.00012 0.0251 8 0.00583 0.461 00 NO 0.2 0.00026 0.03962 0.01293 0.46100 NO Table A.1.21. Experimental Results for Z = X - Y when X - N(0.5,0.0049), n = 50 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00055 0.15378 0.02683 0.46100 NO 0.6 0.00044 0.09795 0.021 17 0.46100 NO 0.2 0.00061 0.07561 0.02942 0.46100 NO

Table A.1.22. Experimental Results for Z = X - Y when X - N(0.5,0.0049), n = 100 1 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00099 0.20435 0.04798 0.46100 NO 0.6 0.00022 0.07686 0.01051 0.46100 NO 0.2 0.00066 0.08632 0.03 187 0.46100 NO

Table A.1.23. Experimental Results for Z = X - Y when X - N(0.5,0.0049), n = 200

Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00042 0.12538 0.02038 0.46100 NO 0.6 0.00004 0.02433 0.00175 0.46100 NO 0.2 0.00007 0.03 183 0.00335 0.46100 NO 7

Table A.1.24. Experimental Results for Z = X - Y when X - N(0.5,0.0049), n = 400..

Correlation MSE ME T2 w 0.95 Significant 0.9 0.00040 0.12290 0.01932 0.46100 NO Table A.1.25. Experimental Results for Z = XY when X - N(0.5,0.0049), n-50 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00029 0.04937 0.1 1 187 0.461 00 NO 0.6 0.00021 0.04595 0.08235 0.46100 NO 0.2 0.00086 0.07293 0.33352 0.46100 NO

Table A.1.26. Experimental Results for Z = XY when X - N(0.5,0.0049), = n 100 - Correlation MSE MAE T2 w 0.95 Signzficant 0.9 0.001 16 0.10938 0.45277 0.46100 NO 0.6 0.00102 0.10003 0.39729 0.46100 NO 0.2 0.00108 0.07397 0.42182 0.46100 NO

Table A.1.27. Experimental Results for Z = XY when X - N(0.5,0.0049), n = 200 Correlation MSE MAE T2 w 0.95 SigniJicant 0.9 0.0001 1 0.04385 0.04144 0.46100 NO 0.6 0.00060 0.06959 0.23461 0.46100 NO

Table A.1.28. Experimental Results for Z = XY when X - N(0.5,0.0049), n = Ann

Correlation MSE ME T2 w 0.95 Signrjkant 0.9 0.00015 0.04582 0.05764 0.461 00 NO 0.6 0.00030 0.04780 0.1 1833 0.46100 NO 0.2 0.00013 0.03 164 0.05006 0.46100 NO Table A.1.29. Experimental Results for Z = XPI' when X - N(0.5,0.0049), n=50 Correlation MSE MAE T2 w 0.9, Signzjkant 0.9 0.00017 0.12944 0.13119 0.46100 NO 0.6 0.00030 0.06630 0.22547 0.46100 NO 0.2 0.00067 0.05734 0.50844 0.461 00 YES

Table A.1.30. Experimental Results for Z = X/Y when X - N(0.5,0.0049), n = 100 Correlation MSE MAE 7'2 w 0.95 SigniJicant 0.9 0.00019 0.20435 0.14335 0.46100 NO 0.6 0.00017 0.05672 0.12552 0.46100 NO 0.2 0.00097 0.08094 0.73757 0.46100 YES

Table A.1.31. Experimental Results for Z = X/Y when X - N(0.5,0.0049), n = 200 Correlation MSE MAE T2 w o.95 Signijicant 0.9 0.00009 0.12538 0.06905 0.46100 NO 0.6 0.00037 0.0771 9 0.27870 0.46100 NO 0.2 0.00027 0.04961 0.20800 0.46100 NO

Table A.1.32. Experimental Results for Z = X/Y when X - N(0.5,0.0049), n = 400 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00014 0.12171 0.10328 0.46100 NO 0.6 0.00037 0.1 1 198 0.27880 0.46 100 NO 0.2 0.00037 0.06081 0.27776 0.46100 NO Table A.1.33. Experimental Results for Z = X + Y when X - Tr(0.0,0.5,1.0), n = 50 Correlation MSE MAE T2 w 0.95 Signljcant 0.9 0.00579 0.13257 0.28392 0.461 00 NO 0.6 0.00071 0.05383 0.03483 0.46100 NO 0.2 0.00023 0.03005 0.01 130 0.46100 NO

Table A.1.34. Experimental Results for Z = X + Y when X - Tr(0.0,0.5,1.0), n.. = 100.. - Correlation MSE MAE T2 w 0.95 Signijcant 0.9 0.00067 0.04396 0.03291 0.46100 NO 0.6 0.00233 0.09593 0.1 1408 0.46100 NO 0.2 0.00091 0.06296 0.04447 0.46100 NO

Table A.1.35. Experimental Results for Z = X + Y when X - Tr(0.0,0.5,1.0), n = 200 Correlation MSE MAE T2 w 0.95 SigniJicant 0.9 0.00024 0.03 159 0.01 173 0.46100 NO 0.6 0.00005 0.01675 0.00253 0.46100 NO 0.2 0.00002 0.00986 0.00083 0.461 00 NO

Table A.1.36. Experimental Results for Z = X + Y when X - Tr(0.0,0.5,1.0), n = 400 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00057 0.03797 0.02797 0.46100 NO 0.6 0.00048 0.04175 0.02365 0.46100 NO 0.2 0.00041 0.03907 0.0201 7 0.46100 NO Table A.1.37. Experimental Results for Z = X - Y when X - Tr(0.0,0.5,1.0), n = 50

Correlation MSE ME T2 w 0.95 SigniJficant 0.9 0.00009 0.04210 0.0045 1 0.46100 NO 0.6 0.00012 0.03584 0.00593 0.46100 NO 0.2 0.00023 0.03927 0.01 139 0.46100 NO

Table A.1.38. Experimental Results for Z = X - Y when X - Tr(0.0,0.5,1.0), n = 100 Correlation MSE MAE T2 w ,,, Signzjkant 0.9 0.00057 0.09770 0.02769 0.46100 NO 0.6 0.00006 0.02588 0.00302 0.461 00 NO 0.2 0.00057 0.05953 0.02747 0.46100 NO

Table A.1.39. Experimental Results for Z = X - Y when X - Tr(0.0,0.5,1.0), n = 200 Correlation MSE MAE T2 w 0.95 SignlJicant 0.9 0.00037 0.07169 0.01771 0.46100 NO 0.6 0.00054 0.05895 0.02630 0.46100 NO 0.2 0.00039 0.05221 0.01 890 0.46100 NO

Table A.1.40. Experimental Results for Z = X - Y when X - Tr(0.0,0.5,1.0), n = 400 Correlation MSE MAE T2 w 0.95 Signzjicant 0.9 0.00037 0.087 12 0.01809 0.46100 NO 0.6 0.00008 0.02628 0.00401 0.46100 NO 0.2 0.00010 0.02556 0.00489 0.46100 NO Table A.1.41. Experimental Results for Z = XY when X - Tr(0.0,0.5,1.0), n=50 Correlation MSE MAE T2 w~.~~SigniJicant 0.9 0.00655 0.13087 2.55315 0.46100 YES 0.6 0.00069 0.04821 0.26906 0.46100 NO 0.2 0.00017 0.02555 0.06767 0.461 00 NO

Table A.1.42. Experimental Results for Z = XY when X - Tr(0.0,0.5,1.0), n = 100 Correlation MSE MAE r2 w 0.95 Significant 0.9 0.00128 0.05485 0.49809 0.46100 YES 0.6 0.00249 0.091 80 0.96885 0.46100 YES 0.2 0.00083 0.058 16 0.32352 0.46100 NO

Table A.1.43. Experimental Results for Z = XY when X - Tr(0.0,0.5,1.0), n = 200

Correlation MSE MAE 7'2 w 0.95 Sign$cant 0.9 0.00016 0.02992 0.06216 0.46100 NO 0.6 0.00015 0.0271 8 0.05737 0.46100 NO 0.2 0.00006 0.01441 0.02495 0.46100 NO

Table A.1.44. Experimental Results for Z = XY when X - Tr(0.0,0.5,1.0), n = 400 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00045 0.03604 0.17468 0.461 00 NO Table A.1.45. Experimental Results for Z = X/Y when X - Tr(0.0,0.5,1.0), n = 50

Correlation MSE MAE T2 w 0,95 SigniJicant 0.9 0.00059 0.08163 0.44459 0.46100 NO

Table A.1.46. Experimental Results for Z = WY when X - Tr(0.0,0.5,1.0), n = 100 Correlation MSE MAE T2 w 0.95 Signzficant 0.9 0.00016 0.09770 0.12269 0.46100 NO 0.6 0.00040 0.03980 0.29952 0.46100 NO 0.2 0.00124 0.08970 0.94242 0.46100 YES 1

Table A.1.47. Experimental Results for Z = XIY when X - Tr(0.0,0.5,1.0), n = 200 Correlation MSE MAE T2 w 0.95 Significant 0.9 0.00008 0.07092 0.06136 0.46100 NO 0.6 0.00039 0.0508 1 0.29486 0.46100 NO 0.2 0.00053 0.04293 0.40385 0.46100 NO

Table A.1.48. Experimental Results for Z = XIY when X - Tr(0.0,0.5,1.0), n = 400 Correlation MSE MAE 7'2 w 0.95 SignzJicant 0.9 0.00020 0.05948 0.1 5268 0.46100 NO 0.6 0.00010 0.02321 0.07419 0.46100 NO 0.2 0.00009 0.02266 0.06538 0.46 100 NO APPENDIX A.2

Comparison of Cumulative Distribution Functions Figure A.2.1. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.90, X- U(0, 1 .o

0.9

0.8

0.7 .-h Z- 0.6 Be 0.5 .-> -(I I 0.4 U Simulation 0.3

0.2

0 1

0.0 0.0 1 .o 2 0 3 0 4 0 z (b)

Figure A.2.2. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.90, X- U(0,l)). z

Figure A.2.3. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.60, X- U(O.1)). Figure A.2.4. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.60, X- U(0,l)). Figure A.2.5. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.20, X - U(0.1)). Figure A.2.6. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.20: X- U(0.1)). Figure A.2.7. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.90, X- N(0.5,0.0049)). Figure A.2.8. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.90, X- N(0.5,0.0049)). l1

-- -- Simulation

- - - - n = 200 ---- n = 400 --- -. -.-

-- -~ - -

Figure A.2.9. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.60, X- N(0.5,0.0049)). Figure A.2.10. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.60, X- iV(0.5.0.0049)). Figure A.2.11. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.20,X- N(0.5,0.0049)). Figure A.2.12. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.20, X- iV(O.5,0.0049)). Figure A.2.13. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.90, X- Tr(0.0,0.50,1.0)) Figure A.2.14. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.90, X - Tr(0.0,0.50,1.0)) Figure A.2.15. C.D.F. Comparisons for (a) Sum and (b) Difference (R' = 0.60, X- Tr(0.0,0.50,1.0)). Figure A.2.16. C.D.F. Comparisons for (a) Product and (b) Quotient (R' = 0.60, X- Tr(0.0.0.50.1.0)).

Figure A.2.18. C.D.F. Comparisons for (a) Product and (b)Quotient (R' = 0.20, X- Tr(O.O,O.jO,1.0)). APPENDIX A.3

Means Tables and Scheffe Post-Hoc Tests Table A.3.1. Table of Means for Correlation Correlation I Average Maximum Absolute Error

Table A.3.2. Table of Means for Observations Number of Observations (n) Average Maximum Absolute Error 50 0.067 100 0.072 200 0.053 400 0.050

Table A.3.3. Table of Means for Observations X Distribution Average Maximum Absolute Error Uniform 0.054 Normal 0.076 - Triangular 0.05 1

Table A.3.4. Table of Means for Observations Operation on X and Y Average Maximum Absolute Error Sum 0.053 Difference 0.067 Product I 0.056 I Quotient I 0.066 I Table A.3.5. Table of Means, Correlation/Obse~ations Interaction Avg. MAE Correlation - Observations 0.2 0.6 0.9 50 0.047 0.060 0.093 100 0.067 0.06 1 0.090 200 0.049 0.054 0.057 400 0.036 0.052 0.061

Table A.3.6. Table of Means, Correlation/Distribution Interaction AV . ME Correlation

Uniform 0.05 1 0.057 0.055 0.059 0.068 0.101 Triangular 1 0.039 1 0.045 1 0.069 1

Table A.3.7. Table of Means, Correlation/Operations Interactinn-... - .- - -.- . - Avg. MAE Correlation O~eration 0.2 I 0.6 1 0.9

Table A.3.8. rable of Means, Observations/Operation Interaction Number ofObservations /n) I Table A.3.9. Table of Means, Obse~ations/DistributionInteraction Avg. ME Number of Observations (n) Distribution 50 100 200 400 Uniform 0.065 0.05 1 0.058 0.043 Normal 0.076 0.098 0.062 0.067 Triangular 0.058 0.068 0.04 0.039

Table A.3.10. Table of Means. O~eration/DistributionInteraction [ Avg. MAE I Operation 1 I Distribution I Sum 1 Difference I Product I Quotient I Uniform 0.057 0.053 0.057 0.05

Nonnal 0.052 I 0.095 0.062 I 0.095 Triangular 1 0.05 1 0.052 ( 0.049 1 0.054 Table A.3.11. Table of Means, Correlation/Obse~ations/ Distribution Interaction

0.90 100 Normal 0.144 0.90 100 Triangular 0.074 0.90 200 Uniform 0.040 0.90 200 Normal 0.079 0.90 200 Triangular 0.05 1 0.90 400 Uniform 0.042 0.90 400 Normal 0.085 0.90 400 Triangular 0.055 Table A.3.12. Table of Means for Correlation/Obse~ations/

I 0.60 I 200 1 Quotient 1 0.063 1 I 0.60 400 I Sum 1 0.037 1 I 0.60 400 I Difference 1 0.066 1 I 0.60 400 I Product ( 0.045 1 I 0.60 400 I Quotient 1 0.059 1 Table A.3.12.-Continued Table A.3.13. Table of Means for Correlation/Distributiolt Table A.3.14. Table of Means for Observations/Dismbution/ Table A.3.14.-Continued

400 Normal Quotient 0.098 400 Triangular Sum 0.040 400 Triangular Difference 0.046 400 Triangular Product 0.036 400 Triangular Quotient 0.035 Table A.3.15. Scheffe Post-Hoc Test for Correlation

0.6 0.9 0.018

Table A.3.16. Scheffe Post-Hoc Test for No. of Observations

Table A.3.17. Scheffe Post-Hoc Test for Distribution

Table A.3.18. Scheffe Post-Hoc Test for Operation APPENDIX A.4

Source Code ListingISample Input File 11 Program Name: CORREST I1 Function: Generate density estimates for functions of two correlated /I random variables, X and Y. The procedure uses a kernel /I density estimate (KDE) for the estimation of joint probability /I density functions and then applies the transformation of I! variables technique to obtain overall probability distributions I1 for the random variable Z = u(X,Y). /I Version: 1.0 11 Date: February 8, 1997 /I Last Revision: May 20, 1997 /I Author: Jeffrey P. Kharoufeh, Industrial and Systems Engineering.

#define pi 3.14 15927 #define MAX -POINTS 2401 void sorter(void); void sort(int, int, int); void swap(int, int, int); float S[MAX_POINTS] [3]; void main() I/ Begin main program

int n; I/ The number of real-world observations int d = 2; /I The number of columns (variables) in experiment. . int no = MAX- POINTS; I/ Number of density evaluation points int ij,k,l,m,p,cntz; 11 Several counters

float avgx,avgy; /I Variables for the calculation of covariance float varx,vary,vamy; /I Variance of x, variance of y, covariance of xy float dets; I/ Determinant of covariance matrix float h,sum,factor; /I The smoothing parameter float a,b,c,D; I/ Variables for calculating inverse of the (2x2) float tempx,tempy; I/Variables for finding min and max values of set

float xmin,xrnax; // Minimum and maximum x-values float yrnin,ymax,hx,hy; I/ Minimum and maximum y-valueslstep size for x and y double np,dp; // Variables used for type-casting double hp, detsp;

char filename [25]; /I Name of input file given by user

float X[400] [2]; // Matrix of real obsewations float grid[49] [2]; /I Contains x-prime and y-prime values float fCMAX-POINTS] [4];// Matrix of x, y, z, f(x,y) float COV[2] [2]; // Will be used for other purposes as well float SINV[2][2]; /I Inverse of the covariance matrix float U[400] [2]; /I U, V, and W are matrices (or vectors) for float V[400] [2]; 11 some intermediate calculations float W[400]; float cdf[MAX-POINTS]; /I Vector of distribution function values

FILE *@; // A pointer to a file

I/ Get real-world observations and validate input data file clrscr(); printf(" CORREST hh"); printf(" Density Estimation for Functions oh"); printf(" Correlated Random Variableshhln"); printf("P1ease enter the name of the input data file (lower case): h"); scanf("%s", filename); if((@ = fopen(filename,"r")) =NULL) { @rintf(stden,"DATA FILE %s NOT FOUNDhn,filename); exit(1); }

if(fscanf(fp,"%dh",&n) = 1) { for(i=O; i

h = 1.77*pow(np,-l/(dp+4)); // Calculate the optimum window width clrscr(); printf("\n\n"); printf("Smoothing Parameter For Kernel = %f\n\nN,h);

/I Get the covariance matrix avgx = 0.0; avgy = 0.0; /I Initialize variables to zero

for(i=O; i

avgx = avgx/n; /I Sample mean of X avgy = avgyln; /I Sample mean of Y

varx = 0.0; vary = 0.0; varxy = 0.0; for(i=O; i

varx = varx/(n-1); // Variance of X vary = vary/(n- I); // Variance of Y varxy = varxy/(n-1); /I Cov(X,Y)

COV[O][O] = varx; COV[O][l] = varxy; COV[l][O] = varxy; COV[1][1] = vary;

printf("The Covariance Matrix ishh"); for(i=O;i<2;i++) { for(j=O;j<2;j++) { printf("%f ",COV[i]lj]); 1 printf("\n"); ) I/ End i & j loops

printf("\nfl); dets = ((COV[O][O]*COV[l][l])-(COV[l][O]*COV[O][l])); printf("Determinant of Covariance Matrix = %fh\nV,dets);

// Invert the covariance matrix to obtain S-inverse (SINV[2][2]) a=COV [0][0] ; b=COV[O] [I]; c=COV[l] [O]; D=COV[l][l];

// Check division by zero if(((a*D-b*c)==O) 11 ((b*c-a*D)=O)) { printf("MATRIX CANNOT BE INVERTED! ! ! \n\nW); exit(1); 1

else {

SINV[O][O]= D/(a*D-b*c); N A simple inverter for 2 x 2. SINv[O][l]= b/(b*c-a*D); SINV[l][0]= c/(b*c-a*D); SINv[l][l]= a/(a*D-b*c);

printf("The Inverse of the Covariance Matrix 1s\n\nn); for(i=O;i<2;i++) { for(j=Od<2;jtt-) { printf("%f ",SINV[i] u]); 1 printf("\nM); ) /I End i & j loops

} 11 End else

I/ Create grid of points, G for evalution of KDE for(i=O; i=X[i][O]){ tempx = X[i][O]; X[i][O] = Xlj][O]; XD][O] = tempx; } if(Xu] [l]>=X[i][l]){ tempy = X[i][l]; X[i][l] = Xu][l]; Xu][l] = tempy; ) } /I End inner for loop } // End outer for loop

xmin = X[O][O]; xmax = X[n-l][O]; pin= X[0][1]; ymax = X[n-1][1]; hx = (xmax-xmin)/50.0; hy = (ymax-ymin)/50.0;

grid[i] [O] = xmin + (i+l)*hx; grid[i] [1] = ymin + (i+l)*hy; 1

sorter(); // Call sorting routine

for(i=O; &no; i++) // Initialize f-matrix for(j=O; j

if((@ = f~pen("sorter.out~~,~~r"))-- NULL) { printf("Data file sorter.out not foundh"); exit(0); 1

// Begin major loop for kernel method p = 0; while(p < no) ( for(i=O; i

// Sum kernel values "vertically" to obtain the density estimate value fIp][31 = 0.0; for (i=O; i

// Type cast values for function pow(x,y) hp = (doub1e)h; detsp = (doub1e)dets; dp = (doub1e)d; np = (doub1e)n;

flp] [3] = ((pow(detsp,-0.5))*@ow(np,- 1.0)) *(pow(h~,-d~)))*fl~IPI; p++; ) I/ End major while loop

fp = fopen("f.out","w"); for(i=O; i< no; i++) I/ Convert f(x,y) to "relative frequency" values. sum = 0.0; for(i=O; i

/I Initialize z-vector to 0 for(i=O; i

/I Arrange the vector of possible z-values in ascending order

flO][0] = flO][2]; cntz = 1; for(i= 1; i

I/ Sum f(x,y) over all (x,y) such that z = u(x,y) to obtain P(Z = z) = h(z). for(i=O; i

// Calculate CDF values for(i=O; i

for(i=O; i 1.O) cdfli] = 1.O; }

fp = fopen("output.~ut'~~"w'~); fprintf(fp,"Output for file: %s\n",filename); fprintf(fp,"Number of observations: %d\n\nM,n); fprintf(fp,"z \t h(z) \t H(z)\nH); for(i=O; i

) /I end main program

// Functions for sorting routine

void sorter() { int row = MAX-POINTS; int i j,choice; int width = 3; int col = 2;

FILE * fp 1;

if((fp1 = fopen("grid.dat","r")) =NULL) printf("Data file not foundh"); for(i=O; i

printf("P1ease select the operation of your choiceh"); printf("by entering the appropriate number:\n\nV); printf(" 1. u(X,Y) = X + Y'm"); printf("2. u(X,Y) = X - Mn"); printf("3. u(X,Y) = X*Y\nU); printf("4. u(X,Y) = X/Mn\nu); case 1: for(i=O; how; i++) S[i][2] = S[i][O]+S[i][l]; break; case 2: for(i=O; icrow; i++) S[i][2] = S[i][O]-S[i][l]; break; case 3 : for(i=O; i

printf("No option selected"); }

sort(row, width, col); @ 1=fopen("sorter.out1',''w1'); for(i=O; &row; iU){ for(j=O; j

) /End Sorter ......

void sort(int row, int width, int col) int ij;

for(i=O; i=S Ij] [col]) swap(i j ,width);

) //End Sort ......

void swap(int first, int second, int width) { float temp[MAX-POINTS] ; int i;

for(i=O; i< width; i++) temp[i] = S [first][i]; for(i=O; i< width; itt) S [first][i] = S [second] [i]; for(i=O; i< width; i++) S [second] [i] = temp[i];

) I/ End Swap ...... Table A.4.1. Sample Input Data File