Quick viewing(Text Mode)

Basic Concepts of Statistics

Basic Concepts of Statistics

Basic Concepts of

In statistics, a population is a set of similar items or events which is of interest for some question or .

A can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothetical and potentially infinite group of objects conceived as a generalization from experience (e.g. the set of all possible hands in a game of poker). Basic Concepts of Statistics

A common aim of statistical analysis is to produce information about some chosen population.

A (dataset , sample ) is a set of data collected from a statistical population by a defined procedure.

The elements of a sample are known as sample points , units or observations . Basic Concepts of Statistics

The data sample may be drawn from a population

• without replacement (i.e. no element can be selected more than once in the same sample), in which case it is a subset of a population;

• with replacement (i.e. an element may appear multiple times in the one sample), in which case it is a multisubset. Basic Concepts of Statistics

A (singular) or sample statistic is a single measure of some attribute of a sample. It is calculated by applying a function (statistical algorithm) to the values of the items of the dataset. More formally, defines a statistic as a function of a sample where the function itself is independent of the unknown estimands ; that is, the function is strictly a function of the data. The term statistic is used both for the function and for the value of the function on a given sample. Basic Concepts of Statistics

A statistic is distinct from a statistical , which is not computable in cases where the population is infinite, and therefore impossible to examine and measure all its items. However, a statistic, when used to estimate a population parameter, is called an . Samples are collected and statistics are calculated from the samples, so that one can make inferences or extrapolations from the sample to the population. Basic Concepts of Statistics

For example, the arithmetic of a sample is called a statistic; its value is frequently used as an estimate of the mean value of all items comprising the population from which the sample is drawn.

However, the population mean is not called a statistic, because it is not obtained from a sample; instead it is called a population parameter, because it is obtained from the whole population. Basic Concepts of Statistics

Sample mean: 1 Sample (unbiased): 1 1 where n is a sample size, xi – elements of a sample. Basic Concepts of Statistics

A is an estimate of the of a continuous variable. To construct a histogram, the first step is to "bin" the of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non- overlapping intervals of a variable. The bins must be adjacent, and are often (but are not required to be) of equal size. Basic Concepts of Statistics

If the bins are of equal size, a rectangle is erected over the bin with height proportional to the —the number of cases in each bin. A histogram may also be normalized to display "relative" frequencies. It then shows the proportion of cases that fall into each of several categories, with the sum of the heights equaling 1. give a rough sense of the density of the underlying distribution. The total area of a histogram used for probability density is always normalized to 1. Basic Concepts of Statistics

In statistics, an empirical distribution function (EDF ) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value. Basic Concepts of Statistics

The empirical distribution function is an estimate of the cdf that generated the points in the sample. Basic Concepts of Statistics

Let be independent, identically distributed , , … , real random variables with the common cdf . Then the EDF is defined as ℎ 1 where is the indicator of event A. is an unbiased estimator for . The mean of the empirical distribution is an unbiased estimator of the population’s mean. The variance of the empirical distribution times is an unbiased estimator of the population’s variance. Basic Concepts of Statistics

However, since real life datasets have limited number of elements and often are prone to fluctuations, determining the appropriate y plotting positions requires some corrections when estimating distribution .

The most widely used method of determining these values is the method of obtaining the rank for each data point, as discussed next.

Median Ranks

Regarding reliability theory, the Median Ranks method is used to obtain an estimate of the unreliability for each failure.

The median rank is the value that the true probability of failure, , should have at the j-th failure out of a sample of n units at the 50% confidence level. Median Ranks

The exact median ranks are obtained by solving the following equation for MR j: 1 1 2 where n is the sample size, j is the order number, and

MR j is the median rank for the j-th failure. Median Ranks

A more straightforward and easier method of estimating median ranks is by applying the following formula: 1 1 · 0.5, , 1 where 2 · 1 , 2 , and 0.5, , - is the quantile function of F- distribution with m and n degrees of freedom, evaluated at the 0.5 point. Median Ranks

Another quick, though less accurate, approximation of the median ranks is also given by: 1 0.5 1; 0.3175 2,3, … , 1; 0.365 0.5 . This formula is known as Filliben's estimate .

Benard’s approximation : 0.3 0.4 Median Ranks

Ex.: Assume that six identical units are being reliability tested at the same operation stress levels. All of these units fail during the test after operating the following number of hours: 93, 34, 16, 120, 53 and 75. Calculate median ranks for the data points using different methods discussed in this sections. Compare the results. Median Ranks

First, we should rank the times-to-failure in ascending order as shown next.

Time-to-failure tj, Failure Order Number hours out of Sample Size of 6 16 1 34 2 53 3 75 4 93 5 120 6 Median Ranks

Let’s start with the simplest method – Banard’s approximation: 0.3 0.4 where n = 6 and j runs from 1 to n. Then

tj, Failure Order Banard’s Median hours Number Rank 16 1 0,10937 34 2 0,26563 53 3 0,42188 75 4 0,57813 93 5 0,73438 120 6 0,89063 Median Ranks

Now we proceed with Filliben’s estimate:

1 0.5 1; 0.3175 2,3,…, 1; 0.365 0.5 .

tj, Failure Order Filliben’s Median hours Number Rank 16 1 0,1091 34 2 0,26434 53 3 0,42145 75 4 0,57855 93 5 0,73566 120 6 0,8909 Median Ranks

Next, we compute median ranks using the formula with F- distribution. 1 2 1 1 2 1 · 0.5, , The values of quantile function (inverse cdf) of F-distribution can be determined with Mathcad function qF . Order m n qF(0.5,m,n) Number 1 12 2 1.36097 2 10 4 1.11257 3 8 6 1.02975 4 6 8 0.97111 5 4 10 0.89882 6 2 12 0.73477 Median Ranks

Then median ranks are:

t , Failure Order j Median Rank hours Number 16 1 0.1091 34 2 0.26445 53 3 0.42141 75 4 0.57859 93 5 0.73555 120 6 0.8909 Median Ranks

Finally, we obtain exact median ranks by solving the cumulative binomial equation:

1 1 2 For j = 1 the left side of the equation can be obtained with Mathcad: Median Ranks

However, the equation itself is of no particular interest to us. We can obtain its solution as follows:

As you see, only one root satisfies the conditions for median ranks: real number between 0 and 1. Median Ranks

If we continue with this algorithm for all j’s, we get:

t , Failure Order j Median Rank hours Number 16 1 0.1091 34 2 0.26445 53 3 0.42141 75 4 0.57859 93 5 0.73555 120 6 0.8909 Median Ranks

Now we compare the values of the median ranks obtained by different methods: Banard Filliben F-Distr. Binom. 0.10937 0.1091 0.1091 0.1091 0.26563 0.26434 0.26445 0.26445 0.42188 0.42145 0.42141 0.42141 0.57813 0.57855 0.57859 0.57859 0.73438 0.73566 0.73555 0.73555 0.89063 0.8909 0.8909 0.8909 As you can see, obtaining median ranks with F-distribution gives you the same result as with exact median ranks. Filliben’s and Banard’s formulae provide results close to the exact one, with Banard’s being slightly less accurate. Median Ranks

Note that median ranks are not the replacement for EDF since they serve different purposes.

Time-to-failure

tj, EDF Median Rank hours 16 0.16667 0.1091 34 0.33333 0.26445 53 0.5 0.42141 75 0.66667 0.57859 93 0.83333 0.73555 120 1 0.8909

Pseudo-Random Number Sampling

Quite often in statistics and simulation we need to obtain samples of random numbers distributed according to a given probability distribution. Modern mathematical software is equipped with pseudo-random number generator producing uniformally distributed samples. To generate samples drawn from other distributions we need to resort to pseudo-random number sampling techniques. Pseudo-Random Number Sampling

The most common technique is inverse transform sampling (Smirnov transform, inverse transformation method ).

Inverse transformation sampling takes uniform samples of a number u between 0 and 1, interpreted as a probability , and then returns the largest number x from the domain of the distribution such that ∞ . Pseudo-Random Number Sampling

Computationally, this method involves computing the quantile function of the distribution — in other words, computing the cumulative distribution function (cdf ) of the distribution and then inverting that function. For a continuous distribution we need to integrate the probability density function (pdf ) of the distribution or to obtain quantile function in an explicit form, which is impossible to do analytically for most distributions (including the ). Pseudo-Random Number Sampling

As a result, this method may be computationally inefficient for many distributions and other methods are preferred.

On the other hand, it is possible to approximate the quantile function of the normal distribution extremely accurately using moderate-degree polynomials , and in fact the method of doing this is fast enough that inversion sampling is now the default method for sampling from a normal distribution in the statistical package R. Pseudo-Random Number Sampling

Let X be a whose distribution can be described by the cdf FX. We want to generate values of X which are distributed according to this distribution. The inverse transform sampling method works as follows: • Generate a random number u from the standard uniform distribution in the interval [0,1]. • Find the inverse of the desired cdf , . • Compute . The computed random variable X has distribution FX(x). Pseudo-Random Number Sampling

Ex.: Suppose we have a random variable ~ 0,1 and a cdf of Weibull distribution

1 In order to perform an inversion we want to solve for . Pseudo-Random Number Sampling

1

1

1 ln 1

· ln 1 Pseudo-Random Number Sampling

The Box–Muller transform is a pseudo-random number sampling method for generating pairs of independent, standard, normally distributed random numbers , given a source of uniformly distributed random numbers. The Box–Muller transform is commonly expressed in two forms. The basic form takes two samples from the uniform distribution on the interval [0, 1] and maps them to two standard, normally distributed samples. The polar form (Marsaglia polar method ) takes two samples from a different interval, [−1, +1], and maps them to two normally distributed samples without the use of sine or cosine functions. Pseudo-Random Number Sampling

Box–Muller transform:

Suppose u1 and u2 are independent random numbers chosen from the uniform distribution on the unit interval (0, 1).

Let 2 ln cos 2 and . 2 ln sin 2

Then z1 and z2 are independent random variables with a standard normal distribution. Pseudo-Random Number Sampling

Marsaglia polar method : Given x and y, independent and uniformly distributed in the closed interval [−1, +1], set . If 0 1, then and · · are independent random variables with a standard normal distribution. Pseudo-Random Number Sampling

Given a sample drawn from standard normal distribution, we can obtain a normally distributed sample with parameters μ and σ by the following equation:

where ~ , and ~ 0,1 . Pseudo-Random Number Sampling

Once the samples of components’ failure times are generated, we can obtain the sample of the system failures, providing that the system configuration is known. For example, if the series system consists of m components and for each of them failure time samples X (i = 1.. m) of size n were generated, then we can get the sample Y of system’s failure times as follows: min , 1. . ∈ , Pseudo-Random Number Sampling

For the parallel hot redundancy system of m components we get: max ∈ , For the cold standby redundancy system of m components we get: Pseudo-Random Number Sampling

Ex.: For the system with RBD shown below provide an algorithm of generating the sample of n system failures.

We assume the failure times of components 1 – 5 are collected into samples X1 – X5, respectively. Pseudo-Random Number Sampling

max 2 , 3 min 1 , max 2 , 3 min 4 , 5

min 1 , max 2 , 3 min 4 , 5 Pseudo-Random Number Sampling

Homework assignment (2)

For a given distribution , ( is a vector of parameters):

. choose parameters so that the distribution’s mean equals ≈1500; . determine the variance (or ) of the distribution with the selected parameters; . find the quantile function ; . write Mathcad program which generates sample of n random numbers distributed with , ; Pseudo-Random Number Sampling

Homework assignment (2) . generate samples with n = 80, 150 and 500 elements; save them into separate Excel files; . for each sample:  retrieve the dataset from file;  find sample mean and variance (or standard deviation); . compare the obtained results with distribution mean and variance. Pseudo-Random Number Sampling

Homework assignment (2)

# Distribution* CDF Parameters

1 Kw-E 1 1 1 , , 0 2 GCEG 0 1 1 1 , 0 3 GW 1 , , 0

4 ECEG 1 0 1 1 , 0 5 EW 1 , , 0

6 CWG 1 0 1 , 0 1 Pseudo-Random Number Sampling

Homework assignment (2)

 Kw-E – Kumaraswamy-Exponential Distribution  GCEG - Generalized Complementary Exponential- Geometric Distribution  GW – Generalized Weibull Distribution  ECEG - Exponentiated Complementary Exponential- Geometric Distribution  EW – Exponentiated Weibull Distribution  CWG - Complementary Weibull-Geometric Distribution

Parameter Estimation

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. Parameter Estimation

The method of (least squares estimation, LSE ) is a standard approach to approximate the solution of overdetermined systems, i.e. sets of equations in which there are more equations than unknowns. "Least squares" that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation. Parameter Estimation

Least-squares problems fall into two categories: • linear or ; • nonlinear least squares, depending on whether or not the residuals are linear in all unknowns. The linear least-squares problem has a closed-form solution. The nonlinear problem is usually solved by iterative refinement; at each iteration the system is approximated by a linear one, and thus the core calculation is similar in both cases. Parameter Estimation

The objective consists of adjusting the parameters of a model function to best fit a . A simple data set consists of n data pairs , , 1, … , . The model function has the form , Θ , where m adjustable parameters are held in the vector Θ. The goal is to find the parameter values , 1, … , for the model that "best" fits the data. Parameter Estimation

The fit of a model to a data point is measured by its residual, defined as the difference between the actual value of the dependent variable and the value predicted by the model:

Θ ,Θ The LSE method finds the optimal parameter values by minimizing the sum of squared residuals: Θ Θ Parameter Estimation

Ex.: Assume that five identical units are being reliability tested. The units fail during the test after operating the following number of hours: 20, 275, 365, 415, and 1020. Assuming that the data follow exponential distribution, estimate the value of the parameter λ.

Here, the vector Θ contains single element - , and , Θ 1 . Parameter Estimation

The naïve approach is to minimize the sum of squared residuals as previously specified: Θ Θ 1 where are median ranks. ti MR i 20 0,129 275 0,314 365 0,5 415 0,686 1020 0,871 Parameter Estimation

However, this would lead to a nonlinear equation with respect to Θ. We can avoid it by linearizing the cdf:

, Θ 1 1 , Θ

ln 1 ,Θ Θ ⇒ ≡ ln 1 ,Θ ln 1 ≡ Θ ≡ ≡ 0 Parameter Estimation

Then the sum of squared residuals: 2 2

To find the minimum of we should set 0 Parameter Estimation

2 2 0

∑ ∑ Parameter Estimation

2 xi xi MR i yi xiyi 20 400 0,129 -0,138 -2,762 275 75625 0,314 -0,377 -103,641 365 133225 0,5 -0,693 -252,999 415 172225 0,686 -1,158 -480,72 1020 1040400 0,871 -2,048 -2088,902 Σ 1421875 -2929,024

2929.024 2.06 10 1421875

2.06 10 Parameter Estimation

The LSE method is quite good for functions that can be linearized. For these distributions, the calculations are relatively easy and straightforward, having closed-form solutions that can readily yield an answer without having to resort to numerical techniques or tables.

LSE is generally best used with data sets containing complete data, that is, data consisting only of single times-to-failure with no censored or interval data. Parameter Estimation

In statistics, maximum likelihood estimation (MLE ) is a method of estimating the parameters of a so the observed data is most probable.

Specifically, this is done by finding the value of the parameter (or parameter vector) Θ that maximizes the ℒ Θ , which is the joint probability (or probability density) of the observed data over a parameter space. Parameter Estimation

The point Θ that maximizes the likelihood function is called the maximum likelihood estimate .

The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of inference within much of the quantitative research of the social and medical sciences and engineering. Parameter Estimation

Consider X – a continuous random variable with pdf:

, Θ ≡ , , , … , where are k unknown parameters which , , … , need to be estimated, with N independent observations, , which correspond in the , , … , case of life data analysis to failure times. The likelihood function is given by:

ℒ Θ ,Θ Parameter Estimation

In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the logarithmic likelihood (log-likelihood ) function :

Λ Θ ln ℒ Θ ln ,Θ The maximum likelihood (or parameter values) of are obtained by maximizing , , … , ℒ Θ or Λ Θ : Λ 0, 1,2,…, Parameter Estimation

Ex.: Assume that five identical units are being reliability tested. The units fail during the test after operating the following number of hours: 20, 275, 365, 415, and 1020. Assuming that the data follow exponential distribution, estimate the value of the parameter λ.

Here, the vector Θ contains single element - , and , Θ Θ. Parameter Estimation

The log-likelihood function: Λ Θ ln Θ 5 ln Θ Θ Substituting failure times for ti, we get:

Λ Θ 5 ln Θ 2095 · Θ Λ 5 5 2095 0 ⇒ Θ 2.39 10 Θ Θ 2095

2.39 10 Parameter Estimation

Analyzing the results of two previous examples, you should notice that parameter estimates differ from one another, and we can’t specify which result is better. We can evaluate residual sum of squares (RSS ) or (MSE ):

Θ Θ ,Θ Θ Parameter Estimation

We also can determine the likelihood of either result by calculating Λ Θ , or, 2Λ Θ - the metric used in various statistical quality tests. Let’s compare these metrics obtained with the results of LSE and MLE:

2Λ 2.06 10 0.035 70.482 2.39 10 0.047 70.379 Parameter Estimation

2Λ 2.06 10 0.035 70.482 2.39 10 0.047 70.379

The results we have here are quite obvious: is the value for which RSS is minimal, so any other parameter value yields greater RSS .

Likewise, is the value that maximizes (and Λ Θ minimizes 2Λ Θ ). Parameter Estimation

Homework assignment (3) For samples generated in Homework assignment (2):

. find least squares estimates of parameters for the distribution , Θ according to your variant in HA(2); . find maximum likelihood estimates of parameters for the distribution , Θ according to your variant in HA(2).

Censoring

Statistical models rely extensively on data to make predictions. In life data analysis, the models are the statistical distributions and the data are the life data or times-to-failure data of our product. The accuracy of any prediction is directly proportional to the quality, accuracy and completeness of the supplied data. Good data, along with the appropriate model choice, usually results in good predictions. Bad or insufficient data will almost always result in bad predictions.

In the analysis of life data, we want to use all available data sets, which sometimes are incomplete or include uncertainty as to when a failure occurred.

Life data can therefore be separated into two types: . complete data (all information is available); . censored data (some of the information is missing). Censoring

Complete data means that the value of each sample unit is observed or known. In the case of life data analysis, our data set (if complete) would be composed of the times-to-failure of all units in our sample. For example, if we tested five units and they all failed (and their times-to-failure were recorded), we would then have complete information as to the time of each failure in the sample. Censoring

In many cases, all of the units in the sample may not have failed (i.e., the event of interest was not observed) or the exact times-to-failure of all the units are not known. This type of data is commonly called censored data . There are three types of possible censoring schemes: • right censored ; • interval censored; • left censored . Censoring

The most common case of censoring is what is referred to as right censored data , or suspended data . In this case data set contains information on units that did not fail. For example, if we tested five units and only three had failed by the end of the test, we would have right censored data (or suspension data) for the two units that did not failed. Censoring

The term “right censored” implies that the event of interest (i.e., the time-to-failure) is to the right of our data point. Censoring

The second type of censoring is commonly called interval censored data. Interval censored data reflects uncertainty as to the exact times the units failed within an interval.

This type of data frequently comes from tests or situations where the objects of interest are not constantly monitored. Censoring

For example, if we are running a test on five units and inspecting them every 100 hours, we only know that a unit failed or did not fail between inspections.

Specifically, if we inspect a certain unit at 100 hours and find it operating, and then perform another inspection at 200 hours to find that the unit is no longer operating, then the only information we have is that the unit failed at some point in the interval between 100 and 200 hours. Censoring

This type of censored data is also called inspection data by some authors. Censoring

It is generally recommended to avoid interval censored data because they are less informative compared to complete data. However, there are cases when interval data are unavoidable due to the nature of the product, the test and the test equipment.

In those cases, caution must be taken to set the inspection intervals to be short enough to observe the spread of the failures. Censoring

The third type of censoring is similar to the interval censoring and is called left censored data. In left censored data, a failure time is only known to be before a certain time. For instance, we may know that a certain unit failed sometime before 100 hours but not exactly when. In other words, it could have failed any time between 0 and 100 hours. Censoring

This is identical to interval censored data in which the starting time for the interval is zero. Censoring

Additionally, we can specify another pair of censoring schemes: Type I censoring occurs if an experiment stops at a predetermined time, at which point any items remaining are right-censored. Type II censoring occurs if an experiment stops when a predetermined number of failures is observed; the remaining subjects are then right-censored. Censoring

MLE for Right Censored Data When performing maximum likelihood analysis on data with suspended items, the likelihood function needs to be expanded to take into account the information on suspended items, namely, that at the moments when observations were terminated these items were still operational (did not fail). Beyond that, the method of solving for the parameter estimates remains the same. Censoring

For example, consider a distribution where X is a continuous random variable with pdf , and cdf , , where is the vector of unknown parameters which need to be estimated from K observed failures and M suspensions at , , … , . Then the log-likelihood function is , , … , formulated as follows:

Λ Θ ln ,Θ ln 1 ,Θ Censoring

MLE for Interval and Left Censored Data The inclusion of left and interval censored data in an MLE solution for parameter estimates involves adding a term to the likelihood equation to account for the data types in question. When using interval data, it is assumed that the failures occurred in an interval; i.e., in the interval from time A to time B (or from time 0 to time B if left censored), where A

In the case of interval data, and given P interval observations , the log- , , , ,…, , likelihood function is modified by adding an extra term:

ln ,Θ ,Θ Note that for left censored data and 0 0, Θ 0. Censoring

The Complete Log-Likelihood Function After including the terms for the different types of data, the log-likelihood function can now be expressed in its complete form:

Λ Θ ln ,Θ ln 1 ,Θ

ln ,Θ ,Θ Censoring

Note, that suspension at time can be represented in interval form with and . , ∞ Taking this into account, all cases of censoring narrow down to the interval type:

Λ Θ ln ,Θ ln ,Θ ,Θ where K is the number of observed failure times Ti, and P – the number of interval observations (Aj, Bj) including right and left censoring data. Parameter Estimation

Ex.: Assume that 15 identical units are being reliability tested. The observations began at 100 hours by which time 3 items have already failed. Then the observations were interrupted for the period of time (500; 600) hours, during which 2 items have failed. Lastly, the observations were terminated at the 1000 hours with 1 item still operational. Assuming that the data follow exponential distribution, estimate the value of the parameter λ. Parameter Estimation

The failure data is tabulated below:

item time 1 time 2 item mode time 1 time 2 1 L 0 100 9 I 500 600 2 L 0 100 10 F 620 3 L 0 100 11 F 650 4 F 170 12 F 720 5 F 250 13 F 790 6 F 450 14 F 800 7 F 480 15 R 1000 ∞ 8 I 500 600

Nine failures were observed directly and the rest are presented as the interval censoring data. Parameter Estimation

In order to perform MLE in Mathcad, first, we should prepare the failure data: Parameter Estimation

Then, specifying pdf, cdf, guess value and log- likelihood function, we get: Censoring

A Note about Complete and Suspension Data

Depending on the event that we want to measure, data type classification (i.e., complete or suspension) can be open to interpretation.

For example, under certain circumstances, and depending on the question one wishes to answer, an item that has failed might be classified as a suspension for analysis purposes. Censoring

To illustrate this, consider the following times-to- failure data for a product that can fail due to modes A, B and C: Censoring

If the objective of the analysis is to determine the probability of failure of the product, regardless of the mode responsible for the failure, we would analyze the data with all data entries classified as failures (complete data). Censoring

However, if the objective of the analysis is to determine the probability of failure of the product due to Mode A only , we would then choose to treat failures due to Modes B or C as suspension (right censored) data. Parameter Estimation

Homework assignment (4)

For the sample of 150 elements generated in Homework assignment (2):

• select arbitrary suspension times (right censoring) so that 10% (20%, 30%) of data points were censored. Make sure that there are at least 3 different suspension groups;

• select arbitrary times of left censoring so that 10% (20%, 30%) of data points were censored. Make sure that there are at least 3 different left-censored groups;

• select arbitrary censoring time intervals so that 10% (20%, 30%) of data points were censored. Make sure that there are at least 3 different interval-censored groups; Parameter Estimation

Homework assignment (4)

• select arbitrary censoring times so that 10% (20%, 30%) of data points were right, left and interval censored. Make sure that there are at least 3 different censored groups;

• with each censored sample find maximum likelihood estimates of parameters for the distribution , Θ according to your variant in HA(2);

• analyze the effect of censoring on the results of parameter estimation (compared to a complete data).

Model Selection

Model selection is the task of selecting a statistical model from a set of candidate models, given data. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice (Occam's razor ). In its most basic forms, model selection is one of the fundamental tasks of scientific inquiry. Determining the principle that explains a series of observations is often linked directly to a mathematical model predicting those observations. Model Selection

The mathematical approach of model selection decides among a set of candidate models ; this set must be chosen by the researcher.

Once the set of candidate models has been chosen, the statistical analysis allows selecting the best of these models.

What is meant by “best” is controversial. Model Selection

A good model selection technique will balance with simplicity . More complex models will be better able to adapt their shape to fit the data (for example, a fifth-order polynomial can exactly fit six points). Model Selection

However, the additional parameters may not represent anything useful. (Perhaps those six points are really just randomly distributed about a straight line.) Model Selection

The most straightforward technique of model selection is to prefer the model with maximum likelihood (log-likelihood) .

If and are the values of log-likelihood Λ Θ Λ Θ functions for two candidate models with estimated parameter vectors and , then the model with Θ Θ greater log-likelihood value should be preferred. For various reasons, in statistics we usually estimate the value 2Λ Θ , and select the model with minimal score. Model Selection

Though this technique accounts for goodness of the fit, it doesn’t take into account model complexity. In order to do that, the Akaike information criterion (AIC) was introduced, which is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Model Selection

Suppose that we have a statistical model of some data. Let k be the number of estimated parameters in the model.

Let Λ be the maximum value of the log-likelihood function for the model. Then the AIC value of the model is the following:

2 2Λ Model Selection

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, because increasing the number of parameters in the model almost always improves the goodness of the fit. Model Selection

Note that AIC tells us nothing about the absolute quality of a model, only the quality relative to other models.

Thus, if all the candidate models fit poorly, AIC will not give any warning of that.

Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Model Selection

Various developments of the AIC were introduced later. Based on ideas from statistics and information theory, researchers suggested different penalties for the number of parameters. Bayesian information criterion (BIC):

ln2Λ Hannan–Quinn information criterion (HQC):

2 ln ln 2Λ where n is the sample size. Model Selection

Another approach for model selection is to apply various statistical goodness-of-the-fit tests.

Generally, they are performed in order to validate the model (as a part of hypothesis testing), i.e. to determine whether the model is applicable to the given data.

However, these tests can be applied to assess the relative quality of models. Model Selection

Cramér–von Mises Test Let be the observed values, in , , … , increasing order, and is the CDF of the model under the test. Then, the statistic is:

1 2 1 12 2 The smaller the value of CM is, the better the quality of the model. Model Selection

Anderson-Darling Test Let be the observed values, in , , … , increasing order, and is the CDF of the model under the test. Then, the statistic is:

, where 2 1 ln ln 1 Again, smaller value of A2 indicates better model. Model Selection

Homework assignment (5) Consider a system with RBD, assigned by your variant, where all items are statistically identical with failure probability from your HA 2. Assuming the MTTF of a single item is equal T hrs, generate the sample of the system failure times (n = 200). Model Selection

Homework assignment (5) Based on techniques mentioned above, select the best model, describing your system failure data, among the following ones:

. from your HA 2 (candidate model 1); . Candidate model 2; . 2-parameter Weibull. Model Selection

Homework assignment (5) You are recommended to hold on to the following steps:

• Select the parameters for so that MTTF of a single item is T hrs; • Generate necessary number of samples for items’ failure times; • Obtain a sample of system failure times according to your RBD; • Find sample mean and sample variance; Model Selection

Homework assignment (5) • Save the obtained sample into Excel file; • For each candidate model (in a separate Mathcad document): Reopen the sample of system failure times; Perform MLE to estimate parameters of a candidate model; Find mean and variance according to the model; 2 Evaluate 2Λ, AIC, BIC, HQC, CM, A ; • Compare the models using all the obtained criteria. Model Selection

Homework assignment (5)   £    £            ¨ © ¦ ¤ ¥ £ ¢ ¡ §    Azeyeva / Barmina 3Hot Kw-E 800 Kw-E GCEG Weibull Breinert 2Cold Kw-E 2000 Kw-E GW Weibull Izotova / Belikov 2Hot GCEG 1000 GCEG GW Weibull Kirillov 3Cold GCEG 800 GCEG ECEG Weibull Larchenko / Vasin 2/3 GW 2500 GW ECEG Weibull Matyushin 3Hot GW 1000 GW EW Weibull Polnikov / Gorbun 2/4 ECEG 2000 ECEG EW Weibull Sergeyeva 2Cold ECEG 1200 ECEG CWG Weibull Khamrayev / Metalnikov 2Hot ECEG 1200 ECEG Kw-E Weibull Chernyak 2/4 EW 1000 EW CWG Weibull Shvetsov / Minko 2/3 EW 800 EW Kw-E Weibull Yatsenko 2/4 CWG 2500 CWG Kw-E Weibull Yamkin 2/3 CWG 1000 CWG GCEG Weibull