Quick viewing(Text Mode)

Introduction to Statistics - Stat 1011

Introduction to Statistics - Stat 1011

Introduction to - Stat 1011

Awol S. Department of Statistics College of Computing & Informatics Haramaya University Dire Dawa, Ethiopia

c 2013/2014 Contents

1 Introduction 1 1.1 Some Statistical Terms ...... 1 1.2 Definition and Classification of Statistics ...... 1 1.2.1 Definitions of Statistics ...... 1 1.2.2 Stages in Statistical Investigation ...... 2 1.2.3 Classification of Statistics ...... 3 1.3 Applications, Uses and Limitations of Statistics ...... 4 1.3.1 Applications of Statistics ...... 4 1.3.2 Uses of Statistics ...... 5 1.3.3 Limitations of Statistics ...... 6 1.4 Types of Variables and Measurement Scales ...... 6 1.4.1 Variable ...... 6 1.4.2 Scales of Measurement ...... 7

2 Methods of Collection and Presentation 9 2.1 Types of Data ...... 9 2.2 Methods of ...... 9 2.2.1 ...... 10 2.2.2 Secondary data ...... 11 2.3 Data Organization ...... 11 2.4 Methods of Data Presentation ...... 14 2.4.1 Distributions ...... 14 2.4.2 Diagrammatic Display of Data ...... 19 2.4.3 Graphical Presentation of Data ...... 23

3 Measures of 25 3.1 Objectives of Measures of Central Tendency ...... 25 3.2 Characteristics of Good Measure of Central Tendency ...... 26 3.3 Summation Notation ...... 26 3.4 ...... 27 3.4.1 ...... 27 3.4.2 ...... 30 3.4.3 Harmonic Mean ...... 32

i CONTENTS CONTENTS

3.5 ...... 34 3.6 Other Measures of Location: Quantiles ...... 36 3.6.1 Quartiles ...... 36 3.6.2 Deciles ...... 38 3.6.3 ...... 39 3.7 ...... 40

4 Measures of Variation, and 43 4.1 Objectives of Measures of Variation ...... 44 4.2 Types of Measures of Variation ...... 44 4.2.1 and Relative Range ...... 45 4.2.2 Quartile Deviation and Coefficient of Quartile Deviation ...... 45 4.2.3 Mean Deviation and Coefficient of Mean Deviation ...... 46 4.2.4 and ...... 48 4.2.5 Coefficient of Variation ...... 51 4.2.6 ...... 52 4.3 Moments ...... 53 4.4 Skewness ...... 54 4.4.1 Frequency Curves ...... 54 4.4.2 Measures of Skewness ...... 56 4.5 Kurtosis ...... 57

5 Elementary Probability 60 5.1 Concept of Set ...... 60 5.2 Basic Probability Terms ...... 62 5.3 Counting Techniques ...... 63 5.4 Definitions of Probability ...... 65 5.5 Some Rules of Probability ...... 68 5.6 Conditional Probability and Independence ...... 69 5.6.1 Conditional Events ...... 69 5.6.2 Independent Events ...... 70

6 Probability Distributions 72 6.1 ...... 72 6.1.1 ...... 72 6.1.2 Expectations of a Random Variable ...... 74 6.2 Common Discrete Distributions ...... 75 6.2.1 The Binomial Distribution ...... 75 6.2.2 The Poisson Distribution ...... 77 6.3 Common Continuous Distributions ...... 78 6.3.1 The ...... 78 6.3.2 Other Continuous Distributions ...... 82

ii Introduction to Statistics - Stat 1011 [email protected]

7 Techniques 83 7.1 Basic Concepts ...... 83 7.2 Reasons for Sampling ...... 83 7.3 Types of Errors ...... 84 7.4 Types of Sampling Techniques ...... 84 7.4.1 Probability Sampling Techniques ...... 85 7.4.2 Non-probability Sampling Techniques ...... 87

8 for a Single Population 88 8.1 Estimation ...... 88 8.1.1 ...... 88 8.1.2 ...... 89 8.2 Hypothesis Testing ...... 89 8.2.1 Basic Concepts in Hypothesis Testing ...... 90 8.2.2 Hypothesis Testing for a Population Mean ...... 91 8.2.3 Confidence Interval for a Population Mean ...... 93

9 Inference for Two or More Populations 94 9.1 Comparison of the Population Mean in Two groups ...... 94 9.1.1 Paired Sample ...... 94 9.1.2 Independent Samples ...... 96 9.2 (ANOVA) ...... 100

10 Simple and Correlation 104 10.1 Correlation ...... 104 10.1.1 ...... 104 10.1.2 Pearson’s Correlation Coefficient ...... 106 10.1.3 Spearman’s ...... 107 10.2 ...... 108 10.2.1 Method of Estimation ...... 109 10.2.2 The Coefficient of Determination ...... 110

iii Chapter 1

Introduction

1.1 Some Statistical Terms

Statistics has become an integral part of our daily lives. Every day we are confronted with some form of statistical information through newspapers, magazines and other forms of communication. Such statistical information has become highly influential in our lives. Before getting involved in the subject matter in detail, let us define some of the terms used extensively in the field of statistics.

• Datum: It is an information taken from an object. It is also known as an observation or an item or a case or a unit.

• Data: are collection of observed values representing one or more characteristics of some objects.

• Population: is the totality of all objects under study.

• Sample: is the subset of the population. Normally a sample should be selected in such a way as to be representative of the population.

1.2 Definition and Classification of Statistics

1.2.1 Definitions of Statistics Statistics can be defined in two senses: plural (as statistical data) and singular (as statis- tical methods).

• Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used when reference is made to facts and figures on a certain characteristic. For example: sales statistics, labor statistics, employment statistics, e.t.c. In this sense the word ”statistics” serves simply as ”data”. But, not all numerical data are statistics. In order for the numerical data to be identified as statistics, it must

1 Introduction to Statistics - Stat 1011 [email protected]

possess certain identifiable characteristics. Some of these characteristics are described as follows:

1. Statistics are aggregate of facts. Single or isolated facts or figures cannot be called statistics as these cannot be compared or related to other figures within the same framework. Accordingly, there must be an aggregate of these figures. For example, if a person says that ”I earn Birr 30000 per year”, it would not be considered as statistics. On the other hand if we say that the salary of a professor at our university is Birr 30000 per year, then this would be considered as statistics since the average has been computed from many related figures such as yearly salaries of many professors. 2. Statistics are numerically expressed. All statistics are stated in numerical figures only. Qualitative statements cannot be called statistics. For example, such qualitative statements as ’Ethiopia is a developing country’ or ’Jack is very tall’ would not be considered as statistical statements. On the other hand comparing per capita income of Ethiopia with that of Kenya would be considered statistical in nature. Similarly, Jack’s height in numbers compared to average height in Ethiopia would also be considered as statistics. 3. Statistics must be placed in relation to each other. The main objective of statistical analysis is to facilitate a comparative and relative study of the desired characteristics of the data. The comparison of facts and figures may be conducted regarding the same characteristics over a period of time from a single source or it may be from various sources at any one given time. For example, prices of different items in a store as such would not be considered statistics. However, prices of one product in different stores constitute statistical data since these prices are comparable. Also, the changes in the price of a product in one store over a period of time would also be considered as as statistical data since these changes provide for comparison over a period of time. However, these comparisons must relate to the same phenomenon or subject so that likes are compared with likes and oranges are not compared with apples.

• Singular sense: Statistics is a science that deals with the method of data collec- tion, data organization, data presentation, data analysis and interpretation of results. It refers to a subject matter that is concerned with extracting relevant information from available data with the aim to make sound decisions. According to this mean- ing, statistics is concerned with the development and application of methods and techniques for collecting, organizing, presenting, analyzing data and interpreting re- sults.

1.2.2 Stages in Statistical Investigation According to the singular sense definition of statistics, a statistical investigation involves five stages: data collection, organization, presentation, analysis and interpretation of re-

2 Introduction to Statistics - Stat 1011 [email protected] sults.

1. Collection of data: Data collection is the first stage in any statistical investiga- tion. It involves the process of obtaining (gathering) a set of related measurements or counts to meet predetermined objectives. Data may be available from existing published sources which may have already been organized in some presentable form. Such information is commonly referred to as secondary data. On the other hand, the investigator may actually collect his or her own data. This is usually warranted when information about some area of inquiry has not been ascertained. In such cases, the data are said to be of primary form.

2. Organization of data: It is usually not possible to derive any conclusion about the main features of the data from direct inspection of the observations. The second purpose of statistics is describing the properties of the data in a summary form. Editing is the first step in the organization of data. Since there may be omissions, inconsistencies, ambiguities, irrelevant answers and recording errors. Once the data is edited, the second step is classification, that is, arranging the collected data according to some common characteristics. Such classified data can more easily be presented. The last step of the organization of data is presenting the classified data in tabular form, using rows and columns (tabulation).

3. Presentation of data: The purpose of data presentation is to have an overview of what the data actually looks like, and to facilitate statistical analysis. Data presen- tation can be done using diagrams and graphs which have great memorizing effect and facilitates comparison.

4. Analysis of data: The analysis of data is the extraction of summarized and com- prehensive numerical description in order to reach conclusions or provide answers to a problem. That is, the basic purpose of data analysis is to make it useful for certain conclusions. This analysis may require from simple to sophisticated mathematical techniques.

5. Interpretation of results: This is the last stage of statistical investigation. Once the data has been analyzed, some numerical value(s) can be achieved. The main job consists of attaching physical meaning or interpretation to these numerical results. This must be true in its meaning and sense. No pre-conceived ideas should be thrusted on the numerical results obtained out of the analysis of the data. Also no attempts should be made to draw more conclusions than the results are actually liable to.

1.2.3 Classification of Statistics Based on the scope of the decision, statistics can be classified into two; descriptive and inferential statistics.

3 Introduction to Statistics - Stat 1011 [email protected]

1. : It refers to the procedures used to organize and summarize masses of data. It is concerned with describing or summarizing the most important features the collected data without going beyond the data themselves. That is, this part deals with only describing the data collected without going any further: that is without attempting to conclude anything that goes beyond the data themselves. The methodologies of descriptive statistics include the methods of data organization like classification, tabulation and constructing frequency distributions and methods of data presentation like diagrammatic and graphical displays and calculations of certain indicators of data like measures of central tendency and measures of dispersion (variation).

2. Inferential statistics: Inferential statistics includes the methods used to find out something about a population based on the sample. It is concerned with drawing statistically valid conclusions about the characteristics of the population based on information obtained from sample. In this form of statistical analysis, descriptive statistics is linked with probability theory in order to generalize the results of the sample to the population. Performing hypothesis testing, determining relationships between variables and making predictions are also inferential statistics.

Example 1.1. Classify the following statements as descriptive and inferential statistics.

1. The average age of the students in this class is 21 years.

2. There is a strong association between smoking and lung cancer.

3. The price of wheat will be increased by 5% in the coming year.

4. Of the students enrolled in Haramaya University this year, 74% are male and 26% are female.

5. The chance of winning the Ethiopian National Lottery in any day is 1 out of 167000.

1.3 Applications, Uses and Limitations of Statistics

1.3.1 Applications of Statistics There is hardly any walk of life which has not been affected by statistics - ranging from a simple household to big business and the government. Hence, in this modern time, statistical information plays a very important role in a wide range of fields. Some of the areas where the knowledge of statistics is usually applied are as follows:

• In scientific research: There is hardly any advanced research going on without the use of statistics in one form or another. Statistics are used extensively in medical, pharmaceutical and agricultural research. The effectiveness of a new drug is de- termined by statistical experimentation and evaluation. In agriculture,

4 Introduction to Statistics - Stat 1011 [email protected]

about crop yields, types of fertilizers and types of soils under different types of en- vironments are commonly designed and analyzed through statistical methods and concepts. In marketing research, statistical tools are indispensable in studying con- sumer behavior, effects of various promotional strategies and so on. In economics, it is used for modeling functional relationships between or among variables. In educa- tion and agricultural extension also it is used to study the effects of certain training. Also in decision making, statistics helps to enhance the power of decision making in the face of uncertainty by providing sufficient information.

• In : Statistics are used in quality control so extensively that even the phenomenon itself is known as statistical quality control. Statistical quality control (SQC) consists of using statistical methods to gather and analyze data on determi- nation and control of quality. Statistical methods help to check whether a product satisfies a given standard. This technique primarily deals with the samples taken randomly and as representative of the entire population, then these samples are an- alyzed and inferences made concerning the characteristics of the population from which these samples were taken. The concept is similar to testing one spoonful from a pot of stew and deciding whether it needs more salt or not. The characteristics of samples are analyzed by statistical quality control and the use of other statistical techniques.

• In natural, social and physical sciences: In natural sciences, in botany, statistics are used in evaluating the effects of temperature and other climatic conditions and types of soil on the health of plants. In the social sciences, statistics are used in all areas of human and social characteristics. Similarly, in physical sciences, for example, the science of meteorology uses statistics in analyzing the data gathered by satellites in predicting weather conditions.

• In other areas: Statistics are commonly used by insurance companies, stock broker- age houses, banks public utility companies and so on. Statistics are also immensely useful to politicians since their chances of winning can be predicted through the use of sampling techniques in random selection of voter samples and studying their attitudes on issues and policies.

1.3.2 Uses of Statistics • Reduction and summarization of data: Statistics condenses and summarizes a large mass of data and presents facts into a few presentable, understandable and precise numerical figures. The raw data, as is usually available, is voluminous and haphazard. It is generally not possible to draw any conclusions from the raw data as collected. Hence it is necessary and desirable to express these data in a few numerical values.

• Facilitating comparison of data: Arrangement of data with respect to different char- acteristics facilitates comparison. Statistical devises such as , percentages,

5 Introduction to Statistics - Stat 1011 [email protected]

ratios, e.t.c. are used for this purpose. • Determining functional relationships between two or more phenomenon: Statistical techniques such as correlation analysis assist in establishing the degree of association between two or more variables. • Formulation and test of hypothesis: For instance, hypothesis like whether a new medicine is effective in curing a disease, whether there is an association between variables can be tested using statistical tools. • Prediction: Statistical methods are highly useful tools in analyzing the past data and predicting some future trends.

1.3.3 Limitations of Statistics • It does not deal with a single observation, rather, as discussed earlier, it only deals with aggregate of facts. For example, the marks obtained by one student in a class does not carry any meaning in itself, unless it is compared with a set standard or with other students in the same class or with his own marks obtained earlier. • Statistical methods are not applicable to qualitative characters and cannot be coded in numerical values. • Statistical results are true on average; i.e. for the majority of cases. Since statistics is not exact science, statistical conclusions are not universally true. That is, statistical laws are not universally true like the laws of physics, chemistry and . • Statistics are liable to be misused or misinterpreted. Statistical interpretation requires a high degree of skill and understanding of the subject. Misuse of statistics may be done due to inadequate and faulty procedures of data collection and sample selection and mainly due to ignorance (lack of knowledge).

1.4 Types of Variables and Measurement Scales

1.4.1 Variable A variable is a characteristic or an attribute that can assume different values. For example: height, family size, gender, ··· . Based on the values that variables assume, variables can be classified as qualitative and quantitative.

Qualitative variables are those variables that do not assume numeric values. For example, gender is qualitative variable. But, quantitative variables are, on the other hand, those variables which assume numeric values. These variables are numeric in nature. Height and family size are examples of quantitative variables.

6 Introduction to Statistics - Stat 1011 [email protected]

Quantitative variables are again classified into two; discrete and continuous variables. Dis- crete variables are those variables that assume whole number values and consist of distinct and recognizable individual elements that can be counted. For example, family size, num- ber of children in a family, number of cars at the traffic light, ··· are some of the discrete variables. These variables assume a finite or countable number of possible values. The values of these variables are obtained by counting (0, 1, 2, ··· ).

The other quantitative variables, continuous variables, takes any value including decimals. These variables can theoretically assume an infinite number of possible values. Their val- ues are obtained by measuring. Examples of continuous variables are height, weight, time, temperature, ···

Generally the values of a variable can be obtained either by counting for discrete variables, by measuring for continuous variables or by making categories for qualitative variables.

Example 1.2. Classify each of the following variables as qualitative and quantitative and if it is quantitative classify as discrete and continuous.

1. Color of automobiles in a dealer’s show room.

2. Number of seats in a movie theater.

3. Classification of patients based on nursing care needed (complete, partial or safers).

4. Number of tomatoes on each plant on a field.

5. Weight of newly born babies.

1.4.2 Scales of Measurement Consider the following two cases.

Case 1:

• Mr A wears 5 when he plays foot ball. • Mr B wears 6 when he plays foot ball.

Who plays better? What is the average shirt number?

Case 2:

• Mr A scored 5 in Stat quiz. • Mr B scored 6 in Stat quiz.

Who did better? What is the average score?

7 Introduction to Statistics - Stat 1011 [email protected]

Based on the number on the shirts, it is not possible to judge whether Mr B plays better. But by using the test score, it is possible to judge that Mr B did better in the exam. Also it not possible to find the average shirt numbers because the numbers on the shirts are simply codes but it is possible to obtain the average test score.

In general, a scale of measurement shows the information contained in the value of a variable, and what mathematical operations and statistical analysis are permissible to be done on the values of the variable. There are four levels of measurement. These levels, from the weakest to the strongest, in order are: nominal scale, ordinal scale, interval scale and ratio scale.

1. Nominal variables: are those qualitative variables which show category of indi- viduals. They reflect classification into mutually exclusive (non-overlapping) and exhaustive categories (name of groups) without any associated . Numbers may be assigned to the variables simply for coding purposes. It is not possible to compare individuals based on the numbers assigned to categories. This scale is the weakest form of measurement. The only mathematical operation permissible on these variables is counting. Some examples of nominal variables are gender, religion, ID No, ethnicity, color,··· .

2. Ordinal variables: are also those qualitative variables whose values can be ordered and ranked. Ranking and counting are the mathematical operations to be done on the values of the variables. However, these ranks only indicate as to which category greater or better but there is no precise difference between the values (categories) of the variable. Example: grade scores (A, B, C, D, F), academic qualifications (B.Sc., M.Sc., Ph.D.), strength (very weak, week, strong, very strong), health status (very sick, sick, cured).

3. Interval variables: are those quantitative variables and identifies not only as to which category is greater or better but also by how much. It is the stronger form of measurement but there is no true zero. Zero indicates low than empty. Examples: temperature, 0 ◦C does not mean there is no temperature but, rather, it is too cold. Similarly, if a student scores 0 in a certain course, it does not mean that the student has no knowledge in the course at all.

4. Ratio variables: These scales are the highest form of measurements. Ratio variables are those quantitative variables but, unlike the interval variables, zero shows absence of the characteristics. All mathematical operations are allowed to be operated on the values of these variables. Examples: height, weight, income, amount of yield, expenditure, consumption,··· .

8 Chapter 2

Methods of Data Collection and Presentation

2.1 Types of Data

The statistical data, as previously discussed, may be classified into two categories depend- ing upon the sources utilized. These categories are: 1. Primary data: Primary data is the one which is collected by the investigator himself for the purpose of a specific inquiry or study. These data are those data collected for the first time either through direct observation or by enquiring individuals under the direct supervision and instruction of the researcher. Such data is original in character and is generated in surveys conducted by individuals or research institutions.

2. Secondary data: When an investigator uses the data which has already been col- lected by others, such data is called secondary data. This data is primary data for the agency that collected it and becomes secondary data for someone else who uses this data for his own purposes. The secondary data can be obtained from journals, official reports, government publications, publications of professional and research organizations and so on. Based on the role of time, data can be classified as cross-sectional and . 1. Cross-sectional data: is a set of observations taken at a point of time.

2. Time series data: is a set of observations collected for a sequence of time usually at equal intervals.

2.2 Methods of Data Collection

The first and foremost task in statistical investigation is data collection. Before the actual data collection, four important points should be considered. These are the purpose of data

9 Introduction to Statistics - Stat 1011 [email protected] collection (why we need to collect data), the kind data to be collected (what type of data to be collected), the source of data (where we can get the data) and the methods of data collection (how can we collect this data).

Once these questions are answered, it becomes necessary to collect the information needed. This information has to be collected from certain individuals, directly or indirectly. Such a technique is known as survey method which is commonly used in social sciences, i.e., problems related to sociology, political science, psychology and various economic studies.

Another way of collecting data is experimentation, i.e., an actual is conducted and then observations (measurements and counts) are recorded. Such experimental studies are common in natural sciences; agriculture, biology, medical science, industry,···

2.2.1 Questionnaire The most common methods of data collection for survey are personal interview and self- administered questionnaire. In these and other methods of data collection, it is necessary to prepare a document, called questionnaire, which contains a set of questions to be an- swered and is used to record the responses.

Questionnaire is a form containing a cover letter that explains about the person conducting the survey and the objectives of the survey, and a set of related questions to be answered by the respondents. One of the most important points in preparing it is that all questions in it must have relevance to the objectives of the survey. In short, the following points should be kept in mind while designing a questionnaire:

• The person conducting the survey should introduce himself and state the objective(s) of the survey, promise of the anonymity and also include instructions how to fill the form as it is necessary in getting correct responses (cover letter).

• The number of questions should be as few as possible. Once the objectives of the sur- vey are clearly defined only questions pertinent to the objectives should be included. The time of the respondent should not be wasted by asking irrelevant questions. In general 5 to 25 questions may be regarded as a fair number. If a lengthy questionnaire is unavoidable, it should preferably be divided into two or more parts.

• Questions should be logically arranged. Questions should be in a logical order (ap- propriate sequence of topics) so that a natural and spontaneous reply is introduced. Topics should not be mixed up and questions should not skip back and forth. For example, it is undesirable to ask a person how many children s/he has before asking whether s/he is married or not. Questions related to identification and description of the respondent should be come first, followed by major information questions. If opinions are requested, such questions should usually be placed at the end of the list.

10 Introduction to Statistics - Stat 1011 [email protected]

• Questions should be simple, short and easy to understand and they should convey one and only one idea. Technical terms should be avoided.

• Sensitive questions (questions of personal and financial nature) should be avoided if possible. Otherwise, such questions should be asked indirectly, by constructing a set of ranges and must put at the last part of the questionnaire. Examples: age (0-25, 26-50, 51-75,>75), salary (below 200, 200-500, 500-1000, >1000).

• Leading questions should be completely avoided. If you ask person like ”You do not smoke cigarette?” the person will automatically say ’Yes I do not’.

• Answers to the questions should not require any calculation.

• Questions should be capable of objective answers.

2.2.2 Secondary data Secondary data should be used with utmost care. So before using this data, the following three points should be considered.

1. Whether the data are suitable for the purpose of investigation. This can be judged in the light of the nature and scope of investigation.

2. If the data obtained is suitable for our purpose it should be look at whether the data are adequate for the purpose of investigation. This can be judged in the light of the time and geographical area covered by the available data.

3. Whether the data are reliable. The data obtained should be checked for its accuracy. In case, if the data are based on a sample, one should see whether the sample is a proper representative of the population.

Once the above points are observed in the secondary data, it is ready to be used for further statistical analysis.

2.3 Data Organization

It is almost impossible for management to deal with all the collected data in the raw form as it is in a haphazard and unsystematic form. In order to describe situations and make inferences about the population even to describe the sample, the data must be organized into some meaningful way.

11 Introduction to Statistics - Stat 1011 [email protected]

Editing Data Before further analysis, the collected data should be edited for completeness, consistency, accuracy and homogeneity.

• Completeness: If the answer to some questions is missing, it becomes necessary to contact the person again and complete the missing information.

• Consistency: Some information given by the respondent may not be compatible in the sense that an information furnished by the individual either does not justify some other information or is contradictory to earlier one.

• Accuracy: It is of vital importance. If the data are inaccurate, the conclusions drawn from it have no relevance. If the investigator has either made a false report or the respondent has deliberately supplied the wrong information, editing will be of no use. In recent times, checks have been evolved to attain accuracy example by sending supervisors to check the work of investigators or reinvestigating a few respondents after a certain gap of time.

• Homogeneity: To maintain homogeneity, the information sheets are checked to see whether the unit of information or measurement is the same in all the . If differences are there, it has to be converted to the same unit during editing.

Classification of Data The next important step towards organizing data is classification. Classification is the separation of items according to similar characteristics and grouping them into various groups. Data may be classified into four broad classes:

1. Geographical classification: This classification groups the data according to lo- cation differences; places, areas or regions among the items. The geographical areas are usually listed in alphabetical order for easy reference.

2. Chronological classification: Chronological classification includes data according to the time period; i.e., weekly, monthly, quarterly, annually, ··· in which the items under consideration occurred.

3. Qualitative classification: In this type of classification, the data is grouped to- gether according to some distinguished characteristic or attribute such as religion, sex, nation and so on. This classification simply identifies whether a given attribute is present or absent in a given population.

4. Quantitative classification: It refers to the classification of data according to some characteristics that has been measured such as classification according to weight, height, income and so on.

12 Introduction to Statistics - Stat 1011 [email protected]

Tabulation of Data A table is a systematic arrangement of data in rows and columns, which is easy to under- stand and makes data fit for further analysis and drawing conclusions.

Tabulation should not be confused with classification, as the two differ in many ways. Mainly the purpose of classification is to divide the data into homogenous groups whereas the data are presented into rows and columns in tabulation. Hence, classification is a pre- liminary step prior to tabulation.

A statistical table, in general, should have the following parts.

1. Table Number: Every table should be identified by a number. It facilitates easy reference. Whenever you refer to the table in the text, you can give the number of the table only.

2. Title: There should be a title at the top of every table. The title should be clear, concise and adequate. The title should answer the questions : What is the data? where is the data? how is the data classified? and, what is the time period of data?

3. Caption: The caption labels the data presented in a column of the table. There may be sub-captions in each caption.

4. Stub: It is a title given to each row.

5. Body: The body of the table is the most important part. The information given in the rows and columns forms the body of the table. It contains the quantitative information to be presented.

6. Footnote: Any explanatory notes concerning the table itself, placed directly beneath the table, is called ’footnote’. The main purpose of footnote is to clarify some of the specific items given in the table or to explain the ambiguities, omissions, if any, about the data shown in the table.

7. Source Note: If the data is collected from secondary sources, a source note is given to disclose the sources from which the data is collected.

Though the format of a table has already been discussed, some guidelines for preparing a table are as follows:

1. The table should contain the required number of rows and columns with stubs and captions and the whole data should be accommodated within the cells formed corre- sponding to these rows and columns.

2. If the quantity is zero, it should be entered as zero. Leaving blank space or putting dash in place of zero is confusing and undesirable.

13 Introduction to Statistics - Stat 1011 [email protected]

3. The unit of measurement should either be given in parentheses just below the col- umn’s caption or in parentheses along with the stub in the row. 4. If any figure in the table has to be specified for a particular purpose, it should be marked with an asterisk or another symbol. The specification of the marked figure should be explained at the beneath of the table with the same mark.

2.4 Methods of Data Presentation

2.4.1 Frequency Distributions The most convenient way of organizing numerical data is to construct a frequency dis- tribution. is the organization of raw data in table form, using classes and frequencies. Here the term ’class’ stands for a description of a group of simi- lar objects in a data set and ’frequency’ is the number of times a variable value is repeated.

There are three types of frequency distributions; categorical, ungrouped and grouped fre- quency distributions. 1. Categorical frequency distribution: It is used when the variable is qualitative i.e. either nominal or ordinal. Each category of the variable represents a single class and the number of times each category repeats represents the frequency of that class.

Example 2.1. The blood type of 25 students is: A B B AB O AB O O B B B A B B AB O A O AB A O O O AB O. Construct categorical frequency distribution.

Class (Blood type) Frequency (Number of students) A 4 B 7 AB 5 O 9 Total 25 Example 2.2. Construct frequency distribution for the following letter grades of 25 students: A B C C C C B B A D A C C A B F C C A B.

2. Ungrouped frequency distribution: It is also called frequency array. It is a frequency distribution of numerical data (quantitative variable) in which each value of the variable represents a single class and the number of times each value repeats represents the frequency of that class.

Example 2.3. Number of children for 21 families is: 2 3 5 4 3 3 2 3 1 0 4 3 2 2 1 1 1 4 2 2 2.

14 Introduction to Statistics - Stat 1011 [email protected]

Class (No. of children) Frequency (No. of families) 0 1 1 4 2 7 3 5 4 3 5 1 Total 21

3. Grouped frequency distribution: Like ungrouped frequency distribution, grouped frequency distribution is used for numerical data but in grouped frequency distri- bution several values of a variable are grouped into one class and the number of observations belonging to the class is the frequency of that class. This frequency distribution is also called continuous frequency distribution.

Example 2.4. Consider the following age group and number of persons:

Class Limits Class Boundaries Frequency (Age in years) (Age in years) (No. of persons) 1-25 0.5-25.5 20 26-50 25.5-50.5 15 51-75 50.5-75.5 25 76-100 75.5-100.5 10 Total 70

(a) Class Limits: The lowest and highest values that can be included in a class are called class limits. The lowest values are called lower class limits and the highest values are called upper class limits. For example: Class limit for the first class is 1-25, where 1 is the lower class limit and 25 is the upper class limit of the first class. (b) Class Boundaries: Class boundaries are class limits when there is no gap between the UCL of one class and the LCL of the next class. The lowest values are called lower class boundaries and the highest values are called upper class boundaries. The class boundary for the first class 0.5-25.5 where the Lower class boundary is 0.5 and the Upper class boundary is 25.5. Note that the UCL of one class is the LCL of the next class. (c) Class Width: It is the difference between UCB and LCB of a certain class. It is also the difference between the lower limits of two consecutive classes or it is the difference between upper limits of two consecutive classes. That is, W = UCB − LCB or W = LCLi − LCLi−1 or W = UCLi − UCLi−1. The class width of the above frequency distribution is W = 25.5 − 0.5 = 25 or W = 26 − 1 = 25 or W = 50 − 25 = 25.

15 Introduction to Statistics - Stat 1011 [email protected]

(d) Class Mark: is the half way between the class limits or the class boundaries of a certain class. LCL + UCL LCB + UCB CM = i i = i i i 2 2

Class marks of the above distribution are CM1 = 13, CM2 = 38, CM3 = 63 and CM4 = 88. Note also that W = CMi − CMi−1.

Relative Frequency Distribution The absolute frequency distribution is a summary table in which the original data is con- densed into groups and their frequencies, which is called absolute frequency distribution. But if a researcher would like to know the proportion or percentage of cases in each group, instead of simply, the number of cases, s/he can do so by constructing a relative frequency distribution table. The relative frequency distribution can be formed by dividing the fre- quency in each class of the frequency distribution by the total number of observations. It can be converted in to a percentage frequency distribution by simply multiplying each relative frequency by 100.

The relative frequencies are particularly helpful when comparing two or more frequency distributions in which the number of cases under investigation are not equal. The percent- age distributions make such a comparison more meaningful, since percentages are relative frequencies and hence the total number in the sample or population under consideration becomes irrelevant.

Class Limits Class Boundaries Relative Frequency Percentage Frequency 1-25 0.5-25.5 20/70=0.2857 28.57 26-50 25.5-50.5 15/70=0.2143 21.43 51-75 50.5-75.5 25/70=0.3571 35.71 76-100 75.5-100.5 10/70=0.1429 14.29 Total 1 100

Cumulative Frequency Distribution The above frequency distributions tell us the actual number (percentage) of units in each class, it does not tell us directly the total number (percentage) of units that lie below or above the specified values of the classes. This can be determined from a cumulative frequency distribution. A cumulative frequency distribution displays the total number of observations above (below) a certain value. When the interest of the investigator focuses on the number of items below a specified value, then this specified value is the upper boundary of the class. It is known as less than cumulative frequency distribution. Similarly, when the interest lies in finding the number of cases above a specified value, then this value is taken as the lower boundary of the specified class and is known as more than cumulative frequency distribution.

16 Introduction to Statistics - Stat 1011 [email protected]

Less than Cumulative Frequency More than Cumulative Frequency Class F Class F <25.5 20 >0.5 10+25+15+20=70 <50.5 20+15=35 >25.5 10+25+15=50 <75.5 20+15+25=60 >50.5 10+25=35 <100.5 20+15+25+10=70 >75.5 10

Construction of Grouped Frequency Distribution 1. Arrange the data in an array form (increasing or decreasing order).

2. Find the Unit of Measurement (U). Unit of measurement is the smallest numerical difference between any two distinct values of the data.

3. Find the Range (R). Range is the maximum numerical difference in the data set, i.e. the difference between the largest and the smallest values of the variable.

4. Determine the number of classes (k) using Sturge’s Rule. k = 1 + 3.322 log N where N is the total number of observations.

5. Specify the class width (W ); W = R/K.

6. Put the smallest value of the data set as the LCL of the first class. Then obtain the LCL of the second class by adding the class width W to the LCL of the first class. Continue adding W until you get k classes. Let X be the smallest observation. Thus, LCL1 = X and LCLi = LCLi−1 + W for i = 2, 3, ··· , k.

7. Now obtain the UCLs of the frequency distribution by adding W − U to the corre- sponding LCLs. UCLi = LCLi + (W − U) for i = 1, 2, ··· k. 1 1 8. Generate the class boundaries. LCB = LCL − U and UCB = UCL + U for i i 2 i i 2 i = 1, 2, ··· k.

Example 2.5. Construct grouped frequency distribution for the following score of 56 stu- dents (out of 40). 31 33 33 34 34 35 35 17 31 36 17 18 19 25 26 27 27 19 20 22 31 36 38 13 22 22 35 36 28 28 29 30 30 36 11 13 16 17 17 22 22 23 23 23 23 24 24 24 25 27 27 28 28 30 13 16

Solution:

1. The array form of the data (increasing order): 11 13 13 13 16 16 17 17 17 17 18 19 19 20 22 22 22 22 22 23 23 23 23 24 24 24 25 25 26 27 27 27 27 28 28 28 28 29 30 30 30 31 31 31 33 33 34 34 35 35 35 36 36 36 36 38

17 Introduction to Statistics - Stat 1011 [email protected]

2. U = 17 − 16 = 1 3. R = L − S = 38 − 11 = 27 4. K = 1 + 3.322 log N = 1 + 3.322 log 56 = 6.81 ≈ 7 5. W = R/K = 27/6.81 = 3.96 ≈ 4 6. W − U = 4 − 1 = 3

k CLs CBs CM Freq. Rel. Freq. Per. Freq. LCF MCF 1 11-14 10.5-14.5 12.5 4 0.0714 7.14 4 56 2 15-18 14.5-18.5 16.5 7 0.1250 12.50 11 52 3 19-22 18.5-22.5 20.5 8 0.1429 14.29 19 45 4 23-26 22.5-26.5 24.5 10 0.1786 17.86 29 37 5 27-30 26.5-30.5 28.5 12 0.2143 21.43 41 27 6 31-34 30.5-34.5 32.5 7 0.1250 12.50 48 15 7 35-38 34.5-38.5 36.5 8 0.1428 14.28 56 8 Total 56 1 100

Example 2.6. The birth weights(kilograms) of 30 children were recorded as follows. 2.0 2.1 2.3 3.0 3.1 2.7 2.8 3.5 3.1 3.7 4.0 2.3 3.5 4.2 3.7 3.2 2.7 2.5 2.7 3.8 3.1 3.0 2.6 2.8 2.9 3.5 4.1 3.9 2.8 2.2 Construct frequency distribution for these data.

Guidelines for Constructing Grouped Frequency Distributions The number of classes and the class width are more or less arbitrary in nature within the general guidelines established for constructing a frequency distribution. The following are some guidelines for such construction: 1. Classes should be complete (it should include all the data set) and non-overlapping (no data should belong to two classes). The classes should be clearly defined and each observation should be included in only one of the class. This that the classes should be chosen in such a manner that one score cannot belong to more than one classes, so that there is no overlapping of classes. 2. The number of classes should be neither too large nor too small. Normally, between 5 and 20 classes are considered to be adequate, that is, 5 ≤ k ≤ 20. In fact it depends on the total number of observations, the larger the number of observations the more the number of class of the frequency distribution. But we need to condense the data set with minimum lose of information in an easy manageable classes. Fewer classes would mean greater class width with consequent loss of accuracy. Too many classes result in greater complexity.

18 Introduction to Statistics - Stat 1011 [email protected]

3. The class width should be the same for all classes.

4. Classes should standardized. A class should follow logical and chronological (increas- ing) order.

5. Classes should be continuous. Even if there are no values in a class the class must be included in the frequency distribution.

6. Open ended classes, where there is no lower limit of the first class or no upper limit of the last class, should be avoided since this creates difficulty in analysis and interpretation.

Advantages and disadvantages of grouped frequency distributions:

• Advantages:

– It condenses a large mass of data into a comparatively small table. – It attracts the attention of even a layman and gives him an insight into the nature of the distribution. – It helps for further statistical analysis, like central tendency, scatter, symmetry, of the data.

• Disadvantages:

– In the grouped frequency distributions, the identity of the observations is lost. We know only the number of observations in a class and do not know what the values are. – Because the selection of the class width and the lower class limit of the first class are to a certain extent arbitrary, different frequency distributions may be constructed for the same data and hence may give contradictory impressions.

2.4.2 Diagrammatic Display of Data 1. Bar Diagram: It is the simplest and most commonly used diagrammatic represen- tation of a frequency distribution. It is the most common presentation for nominal, categorical or discrete data. It uses a serious of separated and equally spaced bars. The heights of the bars represent the frequency or relative frequency of the classes. But the width of the bars has no meaning; however, all the bars should be the same width to avoid distortion. And also the bars are separated by constant distance.

(a) Simple Bar Diagram: is a diagram in which categories of a variable are marked on the X axis and the frequencies of the categories are marked on the Y axis. It is applicable for discrete variables, that is, for data given according to some period, places and timings. These periods and timings are represented on the base line (X axis) at regular interval and the corresponding frequencies

19 Introduction to Statistics - Stat 1011 [email protected]

are represented on the Y-axis. The width of the bar represents nothing (it is meaningless), but it should be equal for all bars. Also, each bar is separated by an equal space.

Example 2.7. Construct simple for the following data. Marital Status Number of Individuals Single 10 Married 7 Divorced 3 Others (Widowed,··· ) 1 Total 21

(b) Component Bar Diagram: is used when there is a desire to show a total or aggregate is divided into its component parts. The bars represent total value of a variable with each total broken into its component parts and different colors are used for identification. In such type of diagrams, a bar is subdivided into parts in proportion to the size of the subdivision. These subdivided rectangles are shaded differently by lines, dots and colors so that they will be very easy to compare the components. Sometimes the volumes of different attributes may be greatly different. For making meaningful comparisons, the components of the attributes are reduced to percentages. In that case each attribute will have 100 as its maximum volume. This sort of component bar diagram is known as percentage bar-diagram.

Example 2.8. Construct component bar chart for the following data.

20 Introduction to Statistics - Stat 1011 [email protected]

Marital Status Male Female Total Single 8 2 10 Married 3 4 7 Divorced 2 1 3 Others (Widowed,··· ) 1 0 1

(c) Multiple Bars Diagram: used to display data on more than one variable. In the multiple bars diagram two or more sets of inter-related data are interpreted.

Example 2.9. Construct multiple bar chart for the following data. Year Coffee Butter Sugar 1997 12 10 7 1998 5 9 8 1999 10 12 7 2000 9 8 8

21 Introduction to Statistics - Stat 1011 [email protected]

2. : Pie chart is popularly used in practice to show the percentage break down of data. A pie chart is simply a circle divided into a number of slices whose sizes correspond to the frequency or relative frequency of each class or a pie chart is a circle representing the total, cut into slices in proportional to the size of the parts that make up the total.

Example 2.10. Construct pie chart for the following data.

Marital Status Number of Individuals Single 10 Married 7 Divorced 3 Others (Widowed,··· ) 1 Total 21

Solution:

22 Introduction to Statistics - Stat 1011 [email protected]

Marital Status Percentage Degree

10 × 100 47.62 × 360 Single =47.62 =171.43 21 100 7 × 100 33.33 × 360 Married =33.33 =119.99 21 100 3 × 100 14.29 × 360 Divorced =14.29 =51.44 21 100 1 × 100 4.76 × 360 Others =4.76 =17.14 21 100

Total 100 360

2.4.3 Graphical Presentation of Data 1. : Histogram is the most common graphical presentation of a frequency distribution for numerical data. It uses a series of adjacent bars in which the width of each bar represents the class width and the heights represent the frequency or relative frequency of the class. It is used for in which the class boundaries are marked on the X axis and the frequencies are marked along the Y axis.

2. Frequency polygon: It is a graph that consists of line segments connecting the intersection of the class marks and the frequencies of a continuous frequency distri- bution. It can also be constructed from histogram by joining the mid-points of each

23 Introduction to Statistics - Stat 1011 [email protected]

bar. It is also called frequency curve if the points are joined by a smooth free hand sketch.

3. Cumulative Frequency (Ogive) curves: As there are two cumulative frequency distributions, there are two Ogive curves. These are the less than cumulative fre- quency which is a line graph joining the intersection points of the upper class bound- aries and their corresponding less than cumulative frequencies and the more than cumulative frequency which is a line graph joining the intersection points of the lower class boundaries and their corresponding more than cumulative frequencies.

24 Chapter 3

Measures of Central Tendency

Usually the collected data are not suitable to draw conclusions about the mass from which it has been taken. Even though the data will be, somewhat summarized after it has de- picted using frequency distributions and presented by using graphs and diagrams, still we cannot make any inference about the data since there are many groups. Hence, organizing a data into a frequency distribution is not sufficient, there is a need for further condensation, particularly, to compare two or more distributions, we may reduce the entire distribution into one number that represents the distribution we need. A single value which can be considered as typical or representative of a set of observations and around which the ob- servations can be considered as centered is called an ’average’ (or average value or center of location). Since, such typical values tend to lie centrally within a set of observations when arranged according to magnitudes; averages are called measures of central tendency.

3.1 Objectives of Measures of Central Tendency

1. To condense a mass of data into one single value: That is to get a single value which is best representative of the data (that describes the characteristics of the entire data). Measures of central tendency, by condensing masses of data into one single value enable us to get an idea of the entire data. Thus, one value can represent thousands of data even more.

2. To facilitate comparison: Statistical devises like averages, percentages and ratios are used for this purpose. For example, to compare the performances of two classes, A and B, instead of comparing each student result, which is infeasible, we can compare the average mark of the two classes.

There are many types of measures of central tendency, each possessing particular properties and each being typical in some unique way. The most frequently encountered ones are: • Computed averages: Mean (Arithmetic Mean, Geometric Mean and Harmonic Mean)

• Positional averages: Median and Quantiles (Quartiles, Deciles and Percentiles)

25 Introduction to Statistics - Stat 1011 [email protected]

• Mode

3.2 Characteristics of Good Measure of Central Ten- dency

There are various measures of central tendency. The difficulty lies in choosing the measure as no hard and fast rule have been made to select any one. However some norms have been set which work as a guideline for choosing a particular measure of central tendency. A mea- sure of central tendency is good or satisfactory if it possesses the following characteristics, of course, there is no measure which satisfy all these properties:

1. It should be calculated based on all observations.

2. It should not be affected by extreme values. It should be as close to the maximum number of observed values as possible.

3. It should be defined rigidly which means it should have a definite value (it should be unique).

4. It should always exist.

5. It should be easy to understand and calculate. It should not subject to complicated and tedious calculations, though the advent of electronic calculators and computers has made it possible.

6. It should be stable with regard to sampling. This means that if a number of samples of the same size are drawn from a population, the measure of central tendency having the minimum variation among the different calculated values should be preferred.

7. It should be capable of further algebraic treatment. By algebraic treatment, we mean the measures should be used further in the formulation of other formulae or it should be used for further statistical analysis.

3.3 Summation Notation n X The sum X1 + X2 + ··· + Xn is denoted by the Greek letter Σ (Sigma) as Xi and it is i=1 called the summation notation.

Properties of the summation notation:

n n n X X X • (Xi ± Yi) = Xi ± Yi i=1 i=1 i=1

26 Introduction to Statistics - Stat 1011 [email protected]

n X • XiYi = X1Y1 + X2Y2 + ··· + XnYn i=1 n X • c = nc where c is a constant. i=1 n n X X • (Xi ± c) = Xi ± nc i=1 i=1 n n X X • cXi = c Xi i=1 i=1 n n n n X 2 X 2 X X 2 • (Xi ± Yi) = Xi ± 2 XiYi + Yi i=1 i=1 i=1 i=1 n n n X X X • XiYi 6= Xi Yi i=1 i=1 i=1 n n X 2 X 2 • Xi 6= ( Xi) i=1 i=1 3.4 Mean

3.4.1 Arithmetic Mean 1. Simple arithmetic mean: The arithmetic mean is the simplest but most useful measure of central tendency. It is nothing but the ’average’ which we compute in our high school arithmetic. It is defined as the sum of all observations divided by the total number of observations. The sample mean is denoted by X¯ (read as X bar) while the population mean is represented by the Greek letter µ, mu.

• For a sample of n raw (individual) observations, X1,X2, ··· ,Xn: n X Xi X¯ = i=1 n • For grouped data (continuous or ungrouped frequency distributions):

k X fiXi X¯ = i=1 k X fi i=1

27 Introduction to Statistics - Stat 1011 [email protected]

th th where Xi is class mark of the i class for grouped data or it is the i class value for ungrouped data and fi is the corresponding frequency. Example 3.1. Find the arithmetic mean of 2, 4 and 8.

n X Xi 2 + 4 + 8 14 X¯ = i=1 = = = 4.67 n 3 3 Example 3.2. Find the mean for the frequency distribution of students score data considered in example ??.

To find the mean of the frequency distribution, the necessary calculations are as follows:

Class Boundaries Class Marks (Xi) Frequency (fi) fiXi 10.5-14.5 12.5 4 50 14.5-18.5 16.5 7 115.5 18.5-22.5 20.5 8 164 22.5-26.5 24.5 10 245 26.5-30.5 28.5 12 342 30.5-34.5 32.5 7 227.5 34.5-38.5 36.5 8 292 Total 56 1436

k 7 X X fiXi fiXi 1436 Thus, X¯ = i=1 = i=1 = = 25.64 k 7 X X 56 fi fi i=1 i=1

Properties of Arithmetic Mean

(a) The algebraic sum of the deviations of each value from the arithmetic mean is n X ¯ zero. That is (Xi − X) = 0. i=1 (b) The sum of the squares of the deviations from the mean is less than the sum of the squares of the deviations about the other score in the distribution, that is, the sum of the squares of the deviation from the mean is minimum. That is, n n X ¯ 2 X 2 ¯ (Xi − X) < (Xi − a) , a 6= X i=1 i=1

28 Introduction to Statistics - Stat 1011 [email protected]

(c) If a constant c is added or subtracted from each value in a distribution, then ¯ ¯ the new mean will be Xnew = Xold ± c respectively. (d) If each value of a distribution is multiplied by a constant c, the new mean will be the original mean multiplied by c. (e) Combined Mean: If there are g different groups having the same units of ¯ ¯ ¯ measurement with mean X1, X2, ··· ,Xg and number of sample observations n1, n2, ··· ,ng respectively, then the mean of all the groups called the combined ¯ mean (denoted by Xc) is given by:

g X n X¯ i i ¯ ¯ ¯ ¯ i=1 n1X1 + n2X2 + ··· + ngXg Xc = g = X n1 + n2 + ··· + ng ni i=1 Examples: i. The mean weight of 50 women working in a factory is 48 kilograms. The mean weight of 75 men working in the same factory is 58 kilograms. Find the mean weight of all workers in the factory. ii. The mean mark in statistics of 50 students in a class was 72 and that of the 35 boys was 75. Find the mean mark of the girls in the class. Ans:65 iii. The mean salary of 100 laborers working in a factory, running in two shifts of 40 and 60 workers respectively is birr 380. The mean salary of the 40 laborers working in the morning shift is 350. Find the mean salary of the 60 laborers working in the evening shift. Solutions: ¯ ¯ ¯ i. nw = 50, Xw = 48, nm = 75, Xm = 58, Xc =? ¯ ¯ ¯ nwXw + nmXm 50 × 48 + 75 × 58 6570 Xc = = = = 52.56 nw + nm 50 + 75 125 ¯ ¯ ¯ ii. n = 50, Xc = 72, nb = 35, Xb = 75, ⇒ ng = n − nb = 50 − 35 = 15, Xg =? ¯ ¯ ¯ ¯ ¯ nbXb + ngXg ¯ nXc − nbXb 50(72) − 35(75) Xc = ⇒ Xg = = = 65 n ng 15 ¯ ¯ ¯ iii. n = 100, Xc = 380, nm = 40, Xm = 350, ne = 60, Xe =? ¯ ¯ ¯ nXc − nmXm 100(380) − 40(350) ⇒ Xe = = = 400 ne 60 Example 3.3. The mean of 100 values was found to be 40. It was latter discovered that a value was misread as 83 instead of 53. Find out the correct mean.

29 Introduction to Statistics - Stat 1011 [email protected]

Example 3.4. The mean of 200 items was found to be 50. Later on it was discovered that two items were wrongly read as 92 and 8 instead of the correct values 192 and 88 respectively. Find the correct mean.

2. : While calculating the simple arithmetic mean, we had given equal importance to all values. But there are cases where the relative importance is not the same for all values. When this is the case, it is necessary to assign them weights (that is, relative importance) and then calculate the weighted arithmetic mean. Let X1, X2, ··· ,Xn be the values and w1, w2, ··· ,wn be the ¯ corresponding weights, then the weighted arithmetic mean is denoted by Xw and is given by: n X wiXi ¯ i=1 w1X1 + w2X2 + ··· + wnXn Xw = n = X w1 + w2 + ··· + wn wi i=1 Example 3.5. A teacher attaches 2 to quiz, 3 to midterm and 5 for final exam. If a student gets 90, 50 and 60 for quiz, midterm and final exam respectively, what is his/her average academic performance? Solution: Xi = 90, 50, 60 and wi = 2, 3, 5

n 3 X X wiXi wiXi 2(90) + 3(50) + 5(60) 630 X¯ = i=1 = i=1 = = = 63 w n 3 X X 2 + 3 + 5 10 wi wi i=1 i=1

The arithmetic mean fulfils all characteristic of good measures of central tendency with the exception that it is highly affected by extreme values. And it cannot be calculated for a frequency distribution with open ended classes (a frequency distribution with no lower boundary of the first class or with no upper class boundary of the last class or with both).

3.4.2 Geometric Mean In algebra, geometric mean is calculated in case of geometric progression, but in statistics no need to bother about the progression. Here, it is the particular type of data for which the geometric mean is of importance because it gives a good mean value. If the variable values are measured as ratios, proportions or percentages and some values are larger in magnitude and others are small, then the geometric mean is a better measure of central tendency than the simple average. The arithmetic mean is very biased toward large num- bers in the series.

Geometric mean is defined as the nth root of the product of n positive numerical values.

30 Introduction to Statistics - Stat 1011 [email protected]

• For raw data, X1,X2, ··· ,Xn: v u n un Y pn GM = t Xi = X1 × X2 × · · · × Xn i=1

• For grouped data: v u k q un Y fi n f1 f2 fk GM = t Xi = X1 × X2 × · · · × Xk i=1

th where Xi is the class mark the i class and fi is corresponding class frequency, k X n = fi. i=1

k X But the above formula is used if n = fi is small. If it is large, it is difficult to calculate i=1 the nth root. Thus, to facilitate the computation, we make use of logarithms. Thus: n 1 X GM = antilog( log X ) for ungrouped data and n i i=1 k 1 X GM = antilog( f log X )) for grouped data. k i i X i=1 fi i=1 The disadvantage of geometric mean is that it will be meaningless if one or more obser- vations are zero or negative. It is also affected by extreme values but not to the extent of arithmetic mean.

Example 3.6. Find the geometric mean of 2, 4 and 8.

v v u n u 3 √ un Y u3 Y p3 3 GM = t Xi = t Xi = 2(4)(8) = 64 = 4 i=1 i=1

Example 3.7. The price of a commodity increased by 5% from 1989 to 1990, 8% from 1990 to 1991 and by 77% from 1991 to 1992. Find the average price increase.

For increment, take the base line value as 100% and then add the % increase so as to get the values in successive years.

31 Introduction to Statistics - Stat 1011 [email protected]

Year % increase Value (X) 100% log Xi 1989-1990 5 105 2.02 1990-1991 8 108 2.03 1991-1992 77 177 2.25 P Total log Xi = 6.30

n 3 1 X 1 X 1 Then, GM = antilog( log X ) = antilog( log X ) = antilog( (6.30) = antilog(2.1) = n i 3 i 3 i=1 i=1 125.89. Therefore, the price increment is 25.89%.

Example 3.8. A machine depreciated by 10% each in the first two years and by 40% in the third year. Find out the average rate of depreciation.

Like the previous one, take the base line value of the machine as 100% and then deduct the % of depreciation so as to get the depreciated values in successive years.

Year % of depreciation Value (X) 100% log Xi 1 10 90 1.95 2 10 90 1.95 3 40 60 1.79 P Total log Xi = 5.69 1 Then, GM = antilog( (5.69) = antilog(1.70) = 50.12. Therefore, the machine depreciated 3 by is 49.88%.

Example 3.9. Decadal percentage growth of population in country A is given below. Find the average rate growth.

Year 1921 1931 1941 1951 1961 1971 1981 % increase 8.25 19.08 32.09 41.49 25.89 37.91 46.02

3.4.3 Harmonic Mean In algebra, harmonic mean is found out in the case of harmonic progression only. But in statistics, harmonic mean is a suitable measure of central tendency when the data pertains to speed, rates and time. That is, harmonic Mean is another specialized average which is useful in averaging variables expressed as rate per unit of time, such as speed, number of units produced per day.

Harmonic mean is defined as the inverse of the arithmetic mean of the reciprocals of the values.

32 Introduction to Statistics - Stat 1011 [email protected]

• For raw data, X1,X2, ··· ,Xn: n n HM = = n 1 1 1 X 1 + + ··· + X X1 X2 Xn i=1 i

• For grouped data,

k k X X fi fi HM = i=1 = i=1 k f1 f2 fk X fi + + ··· + X X1 X2 Xk i=1 i

th where Xi is the class mark of the i class and fi is the corresponding class frequency, k X n = fi. i=1 Similar to weighted arithmetic mean, there is also weighted harmonic mean. It is given by:

n n X X wi wi HM = i=1 = i=1 n w1 w2 wk X wi + + ··· + X X X X 1 2 n i=1 i Harmonic mean is not affected by extreme values. But it cannot be calculated when one or more observations are zero. Example 3.10. Find the harmonic mean of 2, 4 and 8.

Xi = 2, 4, 8; 3 3 HM = = = 3.429 1/2 + 1/4 + 1/8 0.875 Example 3.11. In a factory a mechanic takes 15 days to fabricate a machine, the second mechanic takes 18 days, the third takes 30 days and the fourth takes 90 days. Find the average number of days taken by the workers to fabricate the machine.

Xi = 15, 18, 30, 90; 4 4 HM = = = 23.95 1/15 + 1/18 + 1/30 + 1/90 0.167 Example 3.12. Suppose a train moves 100 km with a speed of 40 km per hour, then 150 km with a speed of 50 km per hour and the next 135 km with a speed of 45 km per hour. Calculate the average speed of the train.

33 Introduction to Statistics - Stat 1011 [email protected]

Xi = 40, 50, 45 and wi = 100, 150, 135; 100 + 150 + 135 385 HM = = = 45.294 100/40 + 150/50 + 135/45 8.500 Example 3.13. Suppose a train moves 5 hours at a speed of 40 km per hour, then 3 hours at a speed of 50 km per hour and the next 5 hours with a speed of 45 km per hour. Calculate the average speed of the train.

Xi = 40, 50, 45 and wi = 5, 3, 5; 5 + 3 + 5 13 HM = = = 43.919 5/40 + 3/50 + 5/45 0.296 Example 3.14. A driver traveled 400 km per day for three days at a speed of 60, 50 and 40 kilometers per hour. Find the average speed of the driver.

Xi = 60, 50, 40 and wi = 400, 400, 400; 400 + 400 + 400 1200 3 HM = = = 48.648 = 400/60 + 400/50 + 400/40 24.667 1/60 + 1/50 + 1/40 Example 3.15. A student reads the first 100 pages of a book at a rate of 5 pages per hour, the next 100 pages at a rate of 8 pages per hour. What is the student’s average reading speed?

Xi = 5, 8 and wi = 100, 100; 2 2 HM = = = 6.15 1/5 + 1/8 0.325

Relationships between AM, GM and HM • For n observations AM ≥ GM ≥ HM √ • For two positive observations GM = AM × HM

Example 3.16. The arithmetic mean of two observations is 36 and their harmonic mean is 25. What is the geometric mean of the two observations?

3.5 Median

It has been pointed out that mean cannot be calculated whenever there is frequency distri- bution with open-ended classes. Also the mean is to a great extent affected by the extreme values. For instance, there are eight persons getting salaries as Birr 150, 225, 240, 260, 275, 290, 300 and 1500. The mean salary of the persons is Birr 405. This value is not a good measure of central tendency because out of the eight people, seven get Birr 300 or

34 Introduction to Statistics - Stat 1011 [email protected]

less. Hence, some better measure is preferable and median is one of them.

Median is the half way point in a data set. It divides a data set into two equal parts such that half of the numbers have a value less than the median and have will have values greater than the median. Graphically, median is located at the intersection point of the less than and more than cumulative frequency curves.

˜ The median (denoted by X) of a set of n observations X1,X2, ··· ,Xn arranged in ascending or descending order of magnitude is the middle value if n is odd or the arithmetic mean of the two middle values if n is even. That is: n + 1th • if n is odd, X˜ = value. 2 nth n th value + + 1 value • if n is even, X˜ = 2 2 . 2 Example 3.17. Find the median of the following two data sets:

a. 180, 201, 220, 191, 219, 209 and 220.

b. 62, 63, 64, 65, 66, 66, 68 and 78

Using the formula for raw data:

a. 4th value=209

b. (4th value + 5th value)/2=(65+66)/2=65.5

The median value for grouped frequency distributions is given by the formula:  n  − F ˜ ˜ 2 X−1 X = LX˜ +   × w fX˜

k X where n = fi is the total number of observations, fX˜ is frequency of the median class, i=1 LX˜ is the lower class boundary of the median class, FX˜−1 is the less than cumulative fre- quency just before the median class or it is the sum of all the frequencies up to but not including the median class and w is the class width of the median class. The median class is the class corresponding to the minimum less than cumulative frequency which contains n the value . 2

Example 3.18. Find the median mark of the students score data and interpret it.

35 Introduction to Statistics - Stat 1011 [email protected]

First calculate less than cumulative frequency of the frequency distribution and identify the median class.

Class Boundaries fi LCF (Fi) 10.5-14.5 4 4 14.5-18.5 7 11 18.5-22.5 8 19 22.5-26.5 10 29 26.5-30.5 12 41 30.5-34.5 7 48 34.5-38.5 8 56 Total 56

The median class is the class having the less than cumulative frequency containing the value n/2 = 56/2 = 28. This implies, 22.5-26.5 is the median class.  n  − F ˜   ˜ 2 X−1 28 − 19 X = LX˜ +   × w = 22.5 + × 4 = 22.5 + 3.6 = 26.1 fX˜ 10

Example 3.19. Find the median of the following data.

Class 13.5-22.5 22.5-31.5 31.5-40.5 40.5-49.5 49.5-58.5 Frequency 3 9 12 20 3

Median is not influenced by extreme values. It can be calculated for FD with open-ended classes, even it can be located if the data is incomplete.

3.6 Other Measures of Location: Quantiles

As discussed before, median divides a given data set into two equal parts. There are also other positional measures that divide a given data set into more than two equal parts. These measures are collectively known as quantiles. Quantiles include quartiles, deciles and percentiles. For all of these measures, first, the data should be arranged in ascending order.

3.6.1 Quartiles Quartiles are values that divide a data set into four equal parts. These values are denoted by Q1, Q2 and Q3 such that 25% of the data fall below Q1, 50% below Q2 and 75% below Q3.

i(n + 1)th Let Q be the ith quartile (i = 1, 2, 3), then Q = value. i i 4

36 Introduction to Statistics - Stat 1011 [email protected]

Example 3.20. Given the data: 420, 430, 435, 438, 441, 449, 490, 500, 510 and 515. Find all the quartiles. i(n + 1)th Q = value, i = 1, 2, 3 i 4 (10 + 1)th Q = value = 2.75th value = 2nd value + 0.75 (3rd value - 2nd value) = 1 4 430+0.75(435-430) = 433.75 2(10 + 1)th Q = value = 5.5th value = 5th value + 0.5 (6th value - 5th value) = 2 4 441+0.5(449-441) = 445 3(10 + 1)th Q = value = 8.25th value = 8th value + 0.25 (9th value - 8th value) = 3 4 500+0.25(510-500)= 502.5  in  − F k Qi−1 X For frequency distribution, Q = L +  4  × w, i = 1, 2, 3 where n = f i Qi  f  i Qi i=1 th is the total number of observations, fDi is frequency of the i quartile class, LQi is the th lower class boundary of the i quartile class, FQi−1 is the less than cumulative frequency just before the ith quartile class and w is the class width of the ith quartile class. The ith quartile class is the class corresponding to the minimum less than cumulative frequency in which contains the value . 4 Example 3.21. Calculate all quartiles for the students score data and interpret the results.  in  − FQi−1 Calculate the less than cumulative frequencies F s first. Q = L + 4 ×w, i = i i Qi   fQi 1, 2, 3

Q1 class: n/4 = 56/4 = 14, The Q1 class is ⇒ 18.5 − 22.5.  n  − F   4 Q1−1 14 − 11 Q1 = LQ1 +   × w = 18.5 + × 4 = 18.5 + 1.5 = 20 fQ1 8

Q2 class: 2n/4 = 2(56)/4 = 28, The Q2 class is ⇒ 22.5 − 26.5.  2n  − F Q2−1 28 − 19 Q = L +  4  × w = 22.5 + × 4 = 22.5 + 3.6 = 26.1 2 Q2   fQ2 10

Q3 class: 3n/4 = 3(56)/4 = 42, The Q3 class is ⇒ 30.5 − 34.5.

37 Introduction to Statistics - Stat 1011 [email protected]

 3n  − F Q3−1 42 − 41 Q = L +  4  × w = 30.5 + × 4 = 30.5 + 0.57 = 31.07 3 Q3   fQ3 7

3.6.2 Deciles Deciles are values that divide the data into ten equal parts. These values are denoted by D1, D2, ··· , D9 such that 10% of the data fall below D1, 20% below D2, ··· , 90% below D9.

i(n + 1)th Let D be the ith decile (i = 1, 2, ··· , 9), then D = value. i i 10 Example 3.22. Given the data: 420, 430, 435, 438, 441, 449, 490, 500, 510 and 515. Find the 1st and 7th deciles. i(n + 1)th D = value, i = 1, 2, ··· , 9 i 10 (10 + 1)th D = value = 1.1th value = 1st value + 0.1 (2nd value - 1st value) = 1 10 420+0.1(430-420) = 421 7(10 + 1)th D = value = 7.7th value = 7th value + 0.7 (8th value - 7th value) = 7 10 490+0.7(500-490)= 497  in  − F k Di−1 X For frequency distribution, D = L + 4 ×w, i = 1, 2, ··· , 9 where n = f i Di  f  i Di i=1 th is the total number of observations, fDi is frequency of the i decile class, LDi is the lower th class boundary of the i decile class, FDi−1 is the less than cumulative frequency just before the ith decile class and w is the class width of the ith decile class. The ith decile class is the class corresponding to the minimum less than cumulative frequency which contains in the value . 10 Example 3.23. Calculate the 5th and 8th deciles for the students score data and interpret the results.  in  − FD −1  10 i  Di = LDi + × w, i = 1, 2, ··· , 9  fDi 

D5 class: 5n/10 = 5(56)/10 = 28, The D5 class is ⇒ 22.5 − 26.5.

38 Introduction to Statistics - Stat 1011 [email protected]

 5n  − F D5−1 28 − 19 D = L +  10  × w = 22.5 + × 4 = 22.5 + 3.6 = 26.1 5 D5   fD5 10

D8 class: 8n/10 = 8(56)/10 = 44.8, The D8 class is ⇒ 30.5 − 35.5.  8n  − F D8−1 44.8 − 41 D = L +  10  × w = 30.5 + × 4 = 30.5 + 2.17 = 32.67 8 D8   fD8 7

3.6.3 Percentiles Percentiles are values that divide a data set into 100 equal parts. These values are denoted by P1, P2, ··· , P99.

i(n + 1)th Let P be the ith (i = 1, 2, ··· , 99), then P = value. i i 100 Example 3.24. Given the data: 420, 430, 435, 438, 441, 449, 490, 500, 510 and 515. Find the 40th and 75th percentiles. i(n + 1)th P = value, i = 1, 2, ··· , 99 i 100 40(10 + 1)th P = value = 4.4th value = 4st value + 0.4 (5th value - 4th value) = 40 100 438+0.4(441-438) = 439.2 75(10 + 1)th P = value = 8.25th value = 8th value + 0.25 (9th value - 8th value) = 75 100 500+0.25(510-500) = 502.5  in  − F k Pi−1 X For frequency distribution, P = L + 4 ×w, i = 1, 2, ··· , 99 where n = f i Pi  f  i Pi i=1 th is the total number of observations, fPi is frequency of the i percentile class, LPi is the th lower class boundary of the i percentile class, FPi−1 is the less than cumulative frequency just before the ith percentile class and w is the class width of the ith percentile class. The ith percentile class is the class corresponding to the minimum less than cumulative frequency in which contains the value . 100 Example 3.25. Calculate the 30th and 80th percentiles for the students score data and interpret the results.

39 Introduction to Statistics - Stat 1011 [email protected]

 in  − FP −1  100 i  Pi = LPi + × w, i = 1, 2, ··· , 99  fPi 

P30 class: 30n/100 = 30(56)/100 = 16.80, The P30 class is ⇒ 18.5 − 22.5.  30n  − F P30−1 16.80 − 11 P = L +  100  × w = 18.5 + × 4 = 18.5 + 1.22 = 19.72 30 P30   fP30 19

P90 class: 90n/100 = 90(56)/100 = 50.40, The P90 class is ⇒ 35.5 − 38.5.  90n  − F P90−1 50.40 − 48 P = L +  100  × w = 35.5 + × 4 = 35.5 + 1.2 = 36.7 90 P90   fP90 8

Example 3.26. The life times (in hours) of eighty randomly selected light bulbs in sum- marized in the following table. Find all the quartiles, the 6th decile and the 65th percentile.

Time 52.5-63.5 63.5-74.5 74.5-85.5 85.5-96.5 ≥96.5 No. of bulbs 6 12 25 18 19

Relationship between Median, Quartiles, Deciles and Percentiles ˜ • X = Q2 = D5 = P50

• Qi = Pi×25, i = 1, 2, 3

• Di = Pi×10, i = 1, 2, ··· , 9

3.7 Mode

Mode is another measure of central tendency. It is a value of a particular type of items which occur most frequently. For instance if shoe size 7 has the maximum demand, size No. 7 is the modal value of shoe sizes. Mode is denoted by Xˆ. A data set may have one mode (uni-modal), two modes (bi-modal), more than two modes (multi-modal) or no mode at all (i.e. when all observations are equally frequent).

In ungrouped (individual series) cases, one can find mode by inspection. After arranging the data in ascending or descending order, the value appearing most frequently (the most frequent value) is taken as the modal value.

Example 3.27. Find the mode of the following data sets.

a. 110, 113, 116, 116, 118, 118, 118, 121 and 123.

40 Introduction to Statistics - Stat 1011 [email protected]

b. 2, 3, 5, 7 and 8.

c. 15, 18, 18, 18, 20, 22, 24, 24, 24, 26 and 26

d. 5, 6, 6, 7, 9, 9, 10, 12 and 12.

e. 1, 1, 0, 1, 0, 0, 0, 2, 4 and 3.

To find the modal value of each data set, just find the value having the highest frequency.

a. Since 118 occurs more than other values, the mode is 118.

b. Each value occurs once (equally frequent), the data has no mode.

c. 18 and 24 occur three times, hence the modal values are 18 and 24 (bi-modal).

d. Tri-modal (multi-modal): 6, 9 and 12.

e. The modal value here is 0 as it occurs more number of times than other values.

In grouped (continuous) frequency distribution, the modal value is located in the class with highest frequency and that class is the modal class.  f − f  ˆ Xˆ Xˆ−1 X = LXˆ + × w (fXˆ − fXˆ−1) + (fXˆ − fXˆ+1) where LXˆ is the lower class boundary of the modal class, fXˆ is frequency of the modal class, fXˆ−1 is the frequency just before the modal class, fXˆ+1 is the frequency just after the modal class and w is the class width of the modal class. The modal class is the class corresponding to the largest frequency.

Example 3.28. Find the modal score of the students score data.

The class having highest frequency is ⇒ 26.5 − 30.5, hence it is the modal class.  f − f  ˆ Xˆ Xˆ−1 X = LXˆ + × w (fXˆ − fXˆ−1) + (fXˆ − fXˆ+1)  12 − 10  Xˆ = 26.5 + × 4 = 26.5 + 1.14 = 27.64 (12 − 10) + (12 − 7) Example 3.29. What is the modal life time of the light bulbs given below in the table.

Time 52.5-63.5 63.5-74.5 74.5-85.5 85.5-96.5 ≥96.5 No. of bulbs 6 12 25 18 19

Mode is not affected by extreme values and can be calculated for open-ended classes. But it often does not exist and its value may not be unique.

41 Introduction to Statistics - Stat 1011 [email protected]

Relationship between AM, Median and Mode In a symmetrical and uni-modal distribution: Mean=Median=Mode For a moderately skewed distribution: Mean-Mode=3(Mean-Median).

EXERCISES:

1. In a certain investigation, 460 persons were involved in the study, and based on an enquiry on their age, it was known that 75% of them were 22 or more years. The following frequency distribution shows the age composition of the persons under study.

Mid age in years 13 18 23 28 33 38 43 48 No. of persons 24 f1 90 122 f2 56 20 33

(a) Find the median and modal life of condensers and interpret them. (b) Find the values of all quartiles. (c) Compute the 5th decile, 25th percentile, 50th percentile and the 75th percentile and interpret the results.

2. The mean annual salary of all employees in a company is 2500. The mean salary of male and female is 2700 and 1700 respectively. Find the percentage of males and females employed in the company.

3. Given the following FD.

Mid price of a commodity (birr) 15 25 35 45 55 Number of items sold 27 A 28 B 19

(a) If 75% of the items were sold in birr 45 or less and most items were sold in birr 34, find the missing frequencies. (b) If 25% of the items were sold in birr 45 or more and most items were sold in birr 34, find the missing frequencies.

Summary Different measures of central tendency and quantiles have been discussed in this chapter. Out of mean, median and mode, the mean (average) is the most commonly used measure of central tendency. But, the other two namely, the median and mode are not any less important. Median is a largely used central measure in psychology, education and other social sciences. Mode is a suitable average for qualitative information like attitude towards disabled people, beauty or intelligence of certain individuals. It is a useful measure for manufacturers.

42 Chapter 4

Measures of Variation, Skewness and Kurtosis

In the previous chapter, we concentrated on central values (measures of central tendency), which gives an idea of the whole mass, that is, a complete set of values. However, the information so obtained is neither exhaustive nor comprehensive, as the mean does not lead us to know whether the observations are close to each other or far apart. Median is a positional average and has nothing to do with the variability of the observations in a data set. This leads as to conclude that a measure of central tendency is not enough to have a clear idea about the data unless all observations are the same. Moreover, two or more data sets may have the same mean and/or median but they may be quite different.

The following table displays the price of a certain commodity in four cities. Find the mean and median prices of the four cities and interpret it. A 30 30 30 30 30 B 28 29 30 31 32 C 10 15 30 45 50 D 0 5 30 55 60

All the four data sets have mean 30 and median is also 30. But by inspection it is ap- parent that the four data sets differ remarkably from one another. So measures of central tendency alone do not provide enough information about the nature of the data. Thus, to have a clear picture of the data, one needs to have a measure of dispersion or variability among observations in the data set.

Variation or dispersion may be defined as the extent of scatteredness of value around the measures of central tendency. Thus, a measure of dispersion tells us the extent to which the values of a variable vary about the measure of central tendency.

43 Introduction to Statistics - Stat 1011 [email protected]

4.1 Objectives of Measures of Variation

1. To have an idea about the reliability of the measures of central tendency. If the degree of scatterdness is large, an average is less reliable. If the value of the variation is small, it indicates that a central value is a good representative of all the values in the data set.

2. To compare two or more sets of data with regard to their variability. Two or more data sets can be compared by calculating the same measure of variation having the same units of measurement. A set with smaller value posses less variability or is more uniform (or more consistent).

3. To provide information about the structure of the data. A value of a measure of variation gives an idea about the spread of the observation. Further, one can surmise about the limits of the expansion of the values in the data set.

4. To pave way to the use of other statistical measures. Measures of variation especially variance and standard deviation lead to many statistical techniques like correlation, regression, analysis of variance,···

4.2 Types of Measures of Variation

• Absolute Measures of Variation: A measure of variation is said to be an abso- lute form when it shows the actual amount of variation of an item from a measure of central tendency and are expressed in concrete units in which the data have been expressed. In other words, all absolute measures of variation have units. As a result, if two or more distributions differ in their units of measurement, their variability cannot be compared by using any absolute measure of variation.

The size of the absolute measures of dispersion depends upon the size of the values in the data. That is, if the size of the values is larger, the value of the absolute measures will also be larger. Therefore, an absolute measures of variation fails to be appropriate for comparing two or more groups if the size of the data among the groups is not the same.

• Relative Measures of Variation: A relative measure of variation is the quotient obtained by dividing the absolute measure by a quantity in respect to which absolute deviation has been computed. It is a unit less pure number and used for making comparisons between different distributions.

44 Introduction to Statistics - Stat 1011 [email protected]

Absolute Measures Relative Measures Range Coefficient of Range Quartile Deviation Coefficient of Quartile Deviation Mean Deviation Coefficient of Mean Deviation Variance and Standard Deviation Coefficient of Variation Standard Scores

Before giving the details of these measures of dispersion, it is worthwhile to point out that a measure of dispersion (variation) is to be adjudged on the basis of all those properties of good measures of central tendency. Hence, their repetition is superfluous.

4.2.1 Range and Relative Range Range is the simplest and crudest measure of dispersion. It is defined as the difference between the largest and the smallest values in the data. • For raw data: R = L − S

• For grouped data: R = UCLlast − LCLfirst Coefficient of Range: L − S • For raw data: CR = L + S UCL − LCL • For grouped data: R = last first UCLlast + LCLfirst Range hardly satisfies any property of good measure of dispersion as it is based on two extreme values only, ignoring the others. It is not also liable to further algebraic treatment.

4.2.2 Quartile Deviation and Coefficient of Quartile Deviation Quartile deviation is sometimes known as Semi- (SIR). The Interquar- tile Range is Q3 − Q1. Thus, Q − Q QD = 3 1 . 2 The corresponding relative measure of variation, coefficient of quartile deviation is: Q − Q CQD = 3 1 . Q3 + Q1 QD involves only the middle 50% of the observations by excluding the observations below the lower quartile and the observations above the upper quartile. Note also that it does not take into account all the individual values occurring between Q1 and Q3. It means that, no idea about the variation of even the 50% mid values is available from this measure. Anyhow it provides some idea if the values are uniformly distributed between Q1 and Q2. It can be calculated for open ended classes.

45 Introduction to Statistics - Stat 1011 [email protected]

4.2.3 Mean Deviation and Coefficient of Mean Deviation The measures of variation discussed so far are not satisfactory in the sense that they lack most of the requirements of a good measure. Mean deviation is a better measure than range and quartile deviation. Mean deviation is the arithmetic mean of the absolute values of the deviation from some measures of central tendency usually the mean and the median of a distribution. Hence we have mean deviation about the mean MD(X¯) and mean deviation about the median MD(X˜).

n n X ¯ X ˜ |Xi − X| |Xi − X| ¯ i=1 ˜ i=1 • For raw data: MD(X) = n and MD(X) = n

k k X ¯ X ˜ fi|Xi − X| fi|Xi − X| ¯ i=1 ˜ i=1 • For grouped data: MD(X) = k and MD(X) = k X X fi fi i=1 i=1 MD is not much affected by extreme values. Its main drawback is that the algebraic negative signs of the deviations are ignored. MD is minimum when the deviation is taken from median. The coefficient of mean deviations are: MD(X¯) MD(X˜) CMD(X¯) = and CMD(X˜) = X¯ X˜ Example 4.1. Calculate the R, CR, QD, CQD, MD(X¯), MD(X˜), CMD(X¯) and CMD(X˜) for the following data: 20, 28, 40, 12, 30, 15, 50.

Data array: 12, 15, 20, 28, 30, 40, 50. 50 − 12 38 R = L − S = 50 − 12,CR = = = 0.613 50 + 12 62

th nd Q1 = ((7 + 1)/4) value = 2 value = 15

th th Q3 = (3(7 + 1)/4) value = 6 value = 40 Q − Q 40 − 15 25 ⇒ QD = 3 1 = = = 12.5 2 2 2 Q − Q 40 − 15 25 ⇒ CQD = 3 1 = = = 0.45 Q3 + Q1 40 + 15 55 12 + 15 + ··· + 50 X¯ = = 27.86 and X˜ = 28 7

46 Introduction to Statistics - Stat 1011 [email protected]

|12 − 27.86| + |15 − 27.86| + ··· + |50 − 27.86| 73.14 MD(X¯) = = = 10.45 7 7 |12 − 28| + |15 − 28| + ··· + |50 − 28| 73 MD(X˜) = = = 10.43 7 7 Example 4.2. Calculate the R, CR, QD, CQD, MD(X¯), MD(X˜), CMD(X¯) and CMD(X˜) for the students score data.

Previously, we have obtained the following quantities for the students score data: ¯ ˜ X = 25.64, X = 26.1, Q1 = 20, Q3 = 31.07 ¯ ¯ ˜ ˜ CBs Xi fi |Xi − X| fi|Xi − X| |Xi − X| fi|Xi − X| 10.5-14.5 12.5 4 13.14 52.56 13.6 54.4 14.5-18.5 16.5 7 9.14 63.98 9.6 67.2 18.5-22.5 20.5 8 5.14 41.12 5.6 44.8 22.5-26.5 24.5 10 1.14 11.40 1.6 16.0 26.5-30.5 28.5 12 2.86 34.32 2.4 28.8 30.5-34.5 32.5 7 6.86 48.02 6.4 44.8 34.5-38.5 36.5 8 10.86 86.88 10.4 83.2 Total 56 338.28 339.2

R = UCLlast − LCLfirst = 38 − 11 = 27 UCL − LCL 38 − 11 27 CR = last first = = = 0.551 UCLlast + LCLfirst 38 + 11 49 Q − Q 31.07 − 20 11.07 QD = 3 1 = = = 5.54 2 2 2 Q − Q 31.07 − 20 11.07 CQD = 3 1 = = = 0.22 Q3 + Q1 31.07 + 20 51.07 X f |X − X¯| ¯ i i 338.28 MD(X) = X = = 6.04 fi 56 X f |X − Xˆ| ˆ i i 339.2 MD(X) = X = = 6.06 fi 56 MD(X¯) 6.04 CMD(X¯) = = = 0.24 X¯ 25.64 MD(Xˆ) 6.06 CMD(Xˆ) = = = 0.23 Xˆ 26.1 Example 4.3. Calculate the R, QD and CQD for the following frequency distribution.

Class limits 10-14 15-19 20-24 25-29 30-34 35-39 Frequency 8 10 22 35 15 10

47 Introduction to Statistics - Stat 1011 [email protected]

4.2.4 Variance and Standard Deviation Variance and standard deviation are the most superior and widely used measures of dis- persions and both measure the average dispersion of the observations around the mean. The variance of a data set is the sum of the squares of the deviation of each observation taken from the mean divided by total number of observations in the data set. The positive square root of variance is called standard deviation. For a population containing N elements, the population standard deviation is denoted by the Greek letter σ (sigma) and hence the population variance is denoted by σ2. v N u N X uX (X − µ)2 u (X − µ)2 i u i • For raw data, σ2 = i=1 and σ = t i=1 N N v k u k X 2 uX 2 fi(Xi − µ) u fi(Xi − µ) u • For grouped data, σ2 = i=1 and σ = u i=1 k u k X u X fi t fi i=1 i=1 For a sample of n elements, the sample variance and standard deviation denoted by S and S2, respectively, are calculated as using the formulae: v n u n X ¯ 2 uX ¯ 2 (Xi − X) u (Xi − X) 2 i=1 t i=1 • For raw data, S = n − 1 and S = n − 1 v k u k X ¯ 2 uX ¯ 2 fi(Xi − X) u fi(Xi − X) u • For grouped data, S2 = i=1 and S = u i=1 k u k X u X fi − 1 t fi − 1 i=1 i=1 Example 4.4. Find the variance and standard deviation of: 20, 28, 40, 12, 30, 15 and 50. a. Take the data as a population. b. Consider it as a sample.

N X 2 (Xi − µ) a. N = 7, µ = 27.86; σ2 = i=1 N (20 − 27.86)2 + ··· + (50 − 27.86)2 1120.86 σ2 = = = 160.12 7 7

48 Introduction to Statistics - Stat 1011 [email protected]

√ ⇒ σ = 160.12 = 12.65

n X ¯ 2 (Xi − X) b. n = 7, X¯ = 27.86; S2 = i=1 n − 1 (20 − 27.86)2 + ··· + (50 − 27.86)2 1120.86 S2 = = = 186.81 6 6 √ ⇒ S = 186.81 = 13.67

Example 4.5. Find the variance and standard deviation of the students score data. The necessary calculation for calculating variance are as follows: X¯ = 25.64

¯ ¯ 2 ¯ 2 CBs Xi fi Xi − X (Xi − X) fi(Xi − X) 10.5-14.5 12.5 4 -13.14 172.6596 690.6384 14.5-18.5 16.5 7 -9.14 83.5396 584.7772 18.5-22.5 20.5 8 -5.14 26.4196 211.3568 22.5-26.5 24.5 10 -1.14 1.2996 12.9960 26.5-30.5 28.5 12 2.86 8.1796 98.1552 30.5-34.5 32.5 7 6.86 47.0596 329.4172 34.5-38.5 36.5 8 10.86 117.9396 943.5168 Total 56 2870.8576

X ¯ 2 fi(Xi − X) 2 2870.8576 σ = X = = 51.27 f 56 √ i σ = 51.27 = 7.16 Example 4.6. Find the variance and standard deviation for the following data.

Class limits 6-10 11-15 16-20 21-25 26-30 31-35 36-40 Frequency 1 2 3 5 4 3 2 The main objection of mean deviation, removal of the negative signs, is removed by taking the square of the deviations from the mean. The first main demerit of variance is that its unit is the square of the unit of measurement of the variable values. For example, the sample variance of 2m, 6m and 4m is 4m2. The interpretation is, on average each value differs from the mean by 4m2, which is completely wrong because one thing the unit of measurement of variance is not the same as that of the data set. The other disadvantage of variance is, the variation of the data is exaggerated because the deviation of the each value from the mean is squared. For the given example, the variation of the data is exaggerated from two to four since it is taking the square of the deviations. Variance also gives more weight the extreme values as compared to those which are near to the mean value.

49 Introduction to Statistics - Stat 1011 [email protected]

Standard deviation is considered to be the best measure of dispersion because the unit of measurement is the same as the data set and the exaggeration made by variance will be eliminated by taking the square root of it. In simple words, it explains the average amount of variation on either sides of the mean. If the standard deviation of the data is small the values are concentrated near the mean and if it large the values are scattered away from the mean.

Properties of Variance and Standard Deviation 1. If a constant is added (subtracted) to (from) each and every observation, the standard deviation as well as the variance remains the same. 2. If each and every value is multiplied by a nonzero constant c, the standard deviation is multiplied by c and the variance is multiplied by c2. 3. If there are g different groups having the same units of measurement with sample ¯ ¯ ¯ means X1, X2, ··· , Xg, number of sample observations n1, n2, ··· , ng and sample 2 2 2 S1 , S2 , ··· , Sg respectively, then the variance of all the groups called the 2 pooled variance (denoted by Sp ) is given by: g X 2 ¯ ¯ 2 (ni − 1)[Si + (Xi − Xc) ] 2 i=1 Sp = g X ni − g i=1 2 ¯ ¯ 2 2 ¯ ¯ 2 2 (n1 − 1)[S1 + (X1 − Xc) ] + ··· + (ng − 1)[Sg + (Xg − Xc) ] ⇒ Sp = n1 + n2 + ··· + ng − g ¯ ¯ ¯ If X1 = X2 = ··· = Xg, g X 2 (ni − 1)S i 2 2 2 2 i=1 (n1 − 1)S1 + (n2 − 1)S2 + ··· + (ng − 1)Sg ⇒ Sp = g = = X n1 + n2 + ··· + ng − g ni − g i=1 Similarly, the pooled population variance can be calculated using the formula: g X 2 2 Ni[σi + (µi − µc) ] 2 i=1 σp = g X Ni i=1 2 2 2 2 2 N1[σ1 + (µ1 − µc) ] + ··· + Ng[σg + (µg − µc) ] ⇒ σp = N1 + N2 + ··· + Ng

50 Introduction to Statistics - Stat 1011 [email protected]

Example 4.7. The mean weight of 150 students is 60 kilograms. The mean weight of boys is 70 kilograms with aa standard deviation of 10 kilograms. For the girls, the mean weight weight is 55 kilograms and the standard deviation 15 kilograms.

a. Find the number of boys and girls.

b. Find the combined standard deviation.

Example 4.8. A distribution consists of four parts characterized as follows. Find the ¯ mean and standard deviation of the distribution. Ans: Xc = 73.8 and σ = 11.93

Part No. of items Mean S.D 1 50 61 8 2 100 70 9 3 120 50 10 4 30 83 11 Example 4.9. The arithmetic mean and standard deviation of a series of 20 items were computed as 20 and 5 respectively. while calculating these, an item 13 was misread as 30. Find the correct mean and standard deviation.

Example 4.10. The following data are some of the particulars of the distribution of weights of boys and girls in a class.

Boys Girls Number 100 50 Mean 60 45 Variance 9 4

a. Find the mean and variance of the combined series.

b. If one of the values is misread as 60 instead of 40 what is the correct standard deviation.

Empirical Relationship between QD, MD and SD 6QD=5MD=4SD

4.2.5 Coefficient of Variation Coefficient of variation is a relative measure of standard deviation. It is the ratio of the standard deviation to the mean and expressed as percent. Hence, it is a unit less measure of variation and also takes into account the size of the means of the distributions. σ • For population: CV = × 100% µ

51 Introduction to Statistics - Stat 1011 [email protected]

S • For sample: CV = × 100% X¯ The distribution having less CV is said to be less variable or more consistent or more uniform. For field experiments, CV , is generally reported. If it is small, it indicates more reliability of of experimental findings.

Example 4.11. Compare the variability of the following two sample data sets using stan- dard deviation and coefficient of variation:

A. 2 Meters, 4 Meters, 6 Meters

B. 600 Liters, 400 Liters, 500 Liters

Example 4.12. The average IQ of statistics students is 110 with standard deviation 5 and the average IQ of mathematics students is 106 with standard deviation 4. Which class is less variable in terms of IQ?

4.2.6 Standard Score The standard score (Z-score) tells us how many standard deviations a given value is above or below the mean depending on whether the Z-score is negative or positive. X − µ • For population: Z = σ X − X¯ • For sample: Z = S Example 4.13. Suppose Yoseph scored 90 on a test in which the mean and standard deviation of the class were 70 and 10 respectively. In another test, Helen scored 600 on which the mean and standard deviation of the class were 560 and 40 respectively. Who is better of relative to his/her class?

Summary A measure of dispersion, specially the variance, is the back bone of statistics. As a matter of fact, statistics involves variance almost in every study in one way or the other. Most of the surveys and experiments are considered as a study of sample units. Hence, the formulae for sampling are mostly used. All the formulae except the variance are not affected whether we consider a population or a sample. Of course, the interpretation of values has to be made accordingly.

52 Introduction to Statistics - Stat 1011 [email protected]

4.3 Moments

Let X is a variable that assumes values X1, X2,··· ,XN . 1. The rth raw about a number A, is defined as:

N X r (Xi − A) • µ00 = i=1 for raw data r N k X r fi(Xi − A) 00 i=1 • µr = k for grouped data. X fi i=1 ¯ 00 X = µ1 + A 2. The rth moment about the origin (i.e., in (1) above A = 0) is defined as:

N X r Xi • µ0 = i=1 for raw data r N k X r fiXi 0 i=1 • µr = k for grouped data. X fi i=1 0 ¯ 2 0 0 2 µ = X, σ = µ2 − (µ1) 3. The rth central moment (i.e., in (1) above A = µ) is defined as:

N X r (Xi − µ) • µ = i=1 for raw data r N k X r fi(Xi − µ) i=1 • µr = k for grouped data. X fi i=1 For any distribution,

53 Introduction to Statistics - Stat 1011 [email protected]

• µ0 = 1

• µ1 = 0 2 • µ2 = σ Example 4.14. Find the first three central moments of the numbers 2, 3 and 7.

Example 4.15. Find the third moment about 3 of the numbers 2, 3 and 7.

4.4 Skewness

4.4.1 Frequency Curves So far it has been discussed that frequency curve is one of the graphical methods of data presentation used for continuous data. It is a graph of smooth line segment joining the intersection points of class marks and frequencies.

1. Symmetric/Normal curve: A symmetric curve is a frequency curve when it looks the same to the left and right of the central point. The distribution spread around a central tendency value in a symmetrical and bell shaped pattern. Specifically in symmetrical (normal) distribution:

• The lengths of both tails are the same. • The mean, median and modal values are approximately equal. • The corresponding pairs of quartiles, deciles and percentiles are equi-distance from the median. For example, first quartile and third quartile have the same distance from the median.

54 Introduction to Statistics - Stat 1011 [email protected]

2. Positively skewed curve: If some observations are extremely large, the mean of the distribution becomes greater than the median or mode. In such case, the distribution is said to be positively skewed. In positively skewed distribution:

• The right tail of the frequency curve is more elongated, longest tail to the right of the central point. • More values are on the left of the mean. • The extreme variation is towards large values (to the right). • Smaller values are more frequent. • Mean>Median>Mode

3. Negatively skewed curve: If some extremely small observations are present, the mean is the smallest of the the other two averages, and the distribution is said to be negatively skewed.

• The left tail is more elongated. • More observations are concentrated on the right of the mean. • The extreme variation is towards lower values (to the left). • Larger values are more frequent than small values • Mean

55 Introduction to Statistics - Stat 1011 [email protected]

4.4.2 Measures of Skewness Skewness is the lack of symmetry or departure (asymmetry) from the normal curve. If the frequency curve is symmetrical then it has no skewness. The measure of the degree of asymmetry is called a measure of skewness. If both tails of a frequency curve are not equally distributed, the curve is asymmetric and is called a skewed curve.

Measures of Skewness

1. The Karl Pearson’s Coefficient of Skewness(Skp):

X¯ − Xˆ S = kp S

• If Skp = 0, the distribution is symmetrical curve.

• If Skp > 0, the distribution is positively skewed.

• If Skp < 0, the distribution is negatively skewed.

2. The Bowley’s Coefficient of Skewness (Skb):

Q3 + Q1 − 2Q2 Skb = Q3 − Q1

• If Skb = 0, the distribution is symmetrical curve.

• If Skb > 0, the distribution is positively skewed.

• If Skb < 0, the distribution is negatively skewed.

56 Introduction to Statistics - Stat 1011 [email protected]

3. The Moment Measure of Skewness(denoted by α3):

µ3 α3 = p 3 µ2

• If α3 = 0, the distribution is symmetrical curve.

• If α3 > 0, the distribution is positively skewed.

• If α3 < 0, the distribution is negatively skewed. Example 4.16. Calculate the Pearson’s and Bowley’s coefficient of skewness for: 2,3,4,4,5,5,5,7,8,9.

Example 4.17. The mean, median and coefficient of variation of 100 observations are found to be 90.84 and 80 respectively. Find the coefficient of skewness.

4.5 Kurtosis

Kurtosis refers to the peakedness or flatness of a certain distribution with respect to the normal distribution. It describes the degree of concentration of observations around the mode of the distribution, whether the values are concentrated more around the mode (a peaked curve) or away from the mode toward both the tails of the frequency curve. Two or more distributions may have identical average, variation and skewness but they may show different degrees of concentration of values of observations around the mode and hence may show different degrees of peakedness.

A distribution which is neither more peaked nor flat topped is called mesokurtic.

57 Introduction to Statistics - Stat 1011 [email protected]

If a distribution is flat toped than normal it is called platykurtic.

A distribution which is more picked than normal is called a leptokurtic distribution.

Measures of Kurtosis

1. The Coefficient of Kurtosis: Q − Q K = 3 1 D9 − D1

58 Introduction to Statistics - Stat 1011 [email protected]

2. The Moment Measure of Kurtosis:

µ4 β = 2 µ2 • If β = 3, the distribution is normal (mesokurtic). • If β > 3, the distribution is leptokurtic. • If β < 3, the distribution is platykurtic.

Example 4.18. The first four central moments of a distribution are 0, 2.5, 0.7 and 18.75. Comment on the skewness and kurtosis of the distribution.

EXERCISES:

1. Find the range, quartile deviation, mean deviation about the mean, mean deviation about the median, mean deviation about the mode, variance, standard deviation and coefficient of variation for the following distribution.

Class 2-4 4-6 6-8 8-10 Frequency 2 5 4 7

Also calculate the Pearson’s coefficient of skewness, moment coefficient of skewness and Bowley’s coefficient of skewness.

2. Three independent distributions each of 100 members and standard deviation 4.5 units are located with their means at 12.1, 17.1 and 22.1 units respectively. Find the standard deviation of the three distributions taken as a whole.

3. The first of the two groups has 100 items with mean 45 and variance 49. If the combined has 250 items with mean 51 and variance 130, find the mean and standard deviation of the second group.

4. Karl Pearson’s coefficient of skewness is +0.32. Its standard deviation is 6.5 and mean is 29.6. Find the median and mode of the distribution.

5. For a distribution, Bowley’s coefficient of skewness is -0.56, the lower quartile is 16.4 and median is 24.2. What is the quartile deviation?

6. If the first two moments of a distribution about the value 5 are 2 and 20. Find the mean and variance of the distribution.

59 Chapter 5

Elementary Probability

As a general concept, probability is the measure of a chance that something will occur. Or it may also be defined as a quantitative measure of uncertainty.

5.1 Concept of Set

In order to discuss the theory of probability, it is essential to be familiar with some ideas and concepts of mathematical theory of set. A set is a collection of well-defined objects which is denoted by capital letters like A, B, C, etc.

In describing which objects are contained in set A, two common methods are available. These methods are:

1. Listing all objects of A. For example, A = {1, 2, 3, 4} describes the set consisting of the positive integers 1, 2, 3 and 4.

2. Describing a set in words, for example, set A consists of all real numbers between 0 and 1, inclusive. It can be written as A = {x : 0 ≤ x ≤ 1}, that is, A is the set of all x’s where x is a real number between 0 and 1, inclusive.

If A = {a1, a2, ··· , an}, then each object ’ai; i = 1, 2, ··· , n’ belonging to set ’A’ is called a member or an element of set A, i.e., ai ∈ A. A set consisting all possible elements under consideration is called a universal set (denoted by U). On the other hand, a set containing no element is called an empty set (denoted by ∅ or {}).

If every element of set A is also an element of set B, A is said to be a subset of B and write as A ⊂ B. Every set is a subset of itself, i.e., A ⊂ A. Empty set is a subset of every set. If A ⊂ B and B ⊂ C, then A ⊂ C. If A ⊂ B and B ⊂ A, then A and B are said to be equal.

Now let us see some methods of combining sets in order to form a new set and develop the main properties.

60 Introduction to Statistics - Stat 1011 [email protected]

1. Union (Or): A set consisting all elements in A or B or both is called the union set of A and B, and write as A ∪ B. That is, A ∪ B = {x : x ∈ A, x ∈ B or x ∈ both}. The set A ∪ B is also called the sum of A and B.

2. Intersection (And) ( A ∩ B): A set consisting all elements in both A and B is called an intersection set of A and B, and write as A ∩ B. This is, A ∩ B = {x : x ∈ A and x ∈ B}. The intersection set of A and B is also called the the product of A and B.

3. Complement (Not): The complement of a set A, denoted by A0, is a set consisting all elements of U that are not in A, i.e., A0 = {x : x∈ / A}.

Equivalent Sets

• Commutative laws:

– A ∪ B = B ∪ A – A ∩ B = B ∩ A

• Associative laws:

– A ∪ (B ∪ C) = (A ∪ B) ∪ C – A ∩ (B ∩ C) = (A ∩ B) ∩ C

• Distributive laws:

– A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) – A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

• Identity laws:

– A ∪ A = A, A ∩ A = A – A ∪ U = U, A ∩ U = A – A ∪ ∅ = A, A ∩ ∅ = ∅

• A ∪ A0 = U and A ∩ A0 = ∅

• ∅0 = U and U 0 = ∅

• De-Morgan’s laws:

– (A ∪ B)0 = A0 ∩ B0 – (A ∩ B)0 = A0 ∪ B0

• A ⊂ B ⇔ B0 ⊂ A0 ⇒ A ∪ B = B and A ∩ B = A

61 Introduction to Statistics - Stat 1011 [email protected]

Sets having no element in common, i.e, A ∩ B = ∅, are called mutually exclusive or disjoint sets. Hence, note that U and ∅ are mutually exclusive sets and also A and A0. If A and B are finite sets, then n(A ∪ B) = n(A) + n(B) − n(A ∩ B).

Example 5.1. Let U = {a, b, c, d, e, f, g, h}. Let A = {a, d, e}, B = {d, e, g, h} and C = {a, d, c, e, h}. Find A ∪ B, A ∩ B, A ∩ B0, A0 ∩ B,(A ∪ B)0,(A ∩ B)0, A ∩ (B ∪ C), A ∪ (B ∩ C).

Example 5.2. Consider U = {x : x ≥ 0} and three events A = {x : x ≤ 100}, B = {x : 50 ≤ x ≤ 200} and C = {x : x ≥ 150}. Find i) A ∪ B ii) A ∩ B iii) B ∩ C iv) (A ∩ B)0

Example 5.3. In a survey conducted among 200 statistics major students, the number of students who visited historical, religious and both sites are found to be 150, 130 and 80 respectively. Find the number of students who visited none of the sites.

5.2 Basic Probability Terms

1. Experiment: It is an activity or a trial that leads to a well-defined results called outcomes, but it is uncertain to which result will occur. Hence, a probability experi- ment is identified by two properties. First, each experiment has several (at least two) possible outcomes and all these outcomes are known in advance and second, none of these outcomes can be predicted with certainty. For example, for the experiment of tossing a fair coin, we cannot be certain whether the outcome will be a head.

2. Outcome: is a result of a single trial (experiment).

3. Sample Space (S): is a collection of all possible outcomes of an experiment. (In these context, S represents the universal set described previously.) Examples: Define the sample space for the following probability experiments.

(a) Tossing a coin: S={Head (H), Tail (T )} (b) Tossing two coins: S = {HH,HT,TH,TT } (c) Rolling a die: S={1,2,3,4,5,6} (d) Selecting an item from a production lot: S={Defective, Non-defective} (e) Introducing a new product: S={Success, Failure}

4. Event: is an outcome or a set of outcomes (having some common characteristics) of an experiment. For example in the experiment of tossing two coins simultaneously if the event E is defined as getting one head, then E = {HT,TH}. In the experiment of rolling a die, let E be an even number, then E = {2, 4, 6}.

62 Introduction to Statistics - Stat 1011 [email protected]

Since S ⊂ S and E ⊂ S, it follows that S and ∅ are also events. S is called certain (sure) event because every outcome is an element of S. The event ∅ is an impossible event because no outcome of the experiment can be an element of ∅.

Definition: Mutually Exclusive Events: Two events A and B are said to be mutually exclusive if they cannot occur together, i.e., A ∩ B = ∅. For example, in the experiment of rolling a die, odd numbers and even numbers are mutually exclusive events.

Let us now use the various methods of combining sets (that is, events) and obtain the new sets (that is, events) which are introduced earlier. Consider the outcome s and events A and B: • If s ∈ A, then A occurs. A0 is the event which occurs if A does not occur. • If s ∈ (A ∪ B), then one of the events A or B occurs or both occur. • If s ∈ (A ∩ B), then both events A and B occur. • If s ∈ (A0 ∩ B0), then neither A nor B occurs. • If s ∈ (A0 ∩ B), then B occurs but not A. • If s ∈ (A ∩ B0) ∪ (A0 ∩ B), then one of the events A or B occurs. • If s ∈ (A ∩ B)0 = A0 ∪ B0, then both events do not occur.

5.3 Counting Techniques

Counting techniques are mathematical models which are used to determine the number of possible ways of arranging or ordering objects. They are used to find a solution to fix the size of the sample space that is extremely large. Example: What is the size of the sample space if a coin is tossed a large number of times say 20 or more? th 1. Addition Rule: Suppose there are k procedures (p1, p2, ··· , pk), in which the i procedure can be done in ni; i = 1, 2, ··· , k ways. Hence, the total number of ways of performing p1 or p2 or ··· or pk is n1 +n2 +···+nk, provided that no two procedures can be performed at the same time or one after the other.

Example 5.4. There are 2 bus and 3 train routes from city X to city Z. In how many ways can a person go from city X to city Z? Ans: 2 + 3 = 5 ways

2. Multiplication Rule: Suppose there are a sequence of k events, in which the ith event has ni; i = 1, 2, ··· , k possibilities, then the total number of possibilities of the whole sequence will be n1 × n2 × · · · × nk.

63 Introduction to Statistics - Stat 1011 [email protected]

Example 5.5. There are 2 bus routes from city X to city Y and 3 train routes from city Y to city Z. In how many ways can a person go from city X to city Z? Ans: 2 × 3=6 ways Example 5.6. There are 6 questions. Each question has 4 choices. How many answer keys must be made? Ans: 4 × 4 × 4 × 4 × 4 × 4 = 46 = 4096 Example 5.7. There are 5 hotels in a city. If 4 persons check into a different hotel, in how many ways can this be done? Ans: 5 × 4 × 3 × 2 = 120 Example 5.8. In how many ways can 6 persons be seat in a row? Ans: 6 × 5 × 4 × 3 × 2 × 1 = 720 Example 5.9. Seven dice are rolled. How many different outcomes are there? Ans: 6 × 6 × 6 × 6 × 6 × 6 × 6 = 67 = 279986

3. Permutation: is the arrangement or selection of objects in a specific order. (a) Permutation Rule 1: The number of permutations of n distinct objects tak- ing all together is n! = n×(n−1)×(n−2)×· · ·×(1). By definition 1! = 0! = 1

Example 5.10. In how many ways can 6 persons be seat in a row? Ans: 6! = 720 Example 5.11. Suppose a photographer must arrange 4 persons in a row for a photograph. In how many different ways can the arrangement be done? Ans: 4! = 24 (b) Permutation Rule 2: The arrangement of n distinct objects in a specific order using r objects at a time is called a permutation of n objects taking r objects at a time, that is, nPr where n! nP = , 0 ≤ r ≤ n. r (n − r)! Example 5.12. In how many ways can 9 books be arranged on a shelf having 9! 4 places? Ans: 9P = = 9 × 8 × 7 × 6 = 3024 4 (9 − 4)! Example 5.13. How many 5 letter permutations can be formed from the letters 8! in the word ’DISCOVER’? Ans: 8P = = 8 × 7 × 6 × 5 × 4 = 6720 5 (8 − 5)!

(c) Permutation Rule 3: The number of permutations of n objects in which n1 are alike, n2 are alike, ··· , nr are alike is given by n!

n1! × n2! × · · · × nr!

where n1 + n2 + ··· + nr = n

64 Introduction to Statistics - Stat 1011 [email protected]

Example 5.14. How many different permutations can be made from the letters in the word 10! a. STATISTICS. Ans: 3! × 3! × 1! × 2! × 1! 11! b. MISSISSIPPI. Ans: 1! × 4! × 4! × 2! 9! c. EXERCISES. Ans: 3! × 1! × 1! × 1! × 1! × 2 4. Combination: is the arrangement or selection of objects without regard to order. Here, order does not matter. The number of combinations of r objects selected from n objects is denoted by n nC = where r r

n n! nC = = ; 0 ≤ r ≤ n r r (n − r)! × r!

Note: The difference between permutation and combination is that in combination, the or- der of objects being selected (arranged) is not important, but order matters in permutation.

Example 5.15. In how many different ways can a secretary, a president and a manager be selected from 5 persons?

Example 5.16. A committee of 3 persons is to be selected from 5 persons. In how many different ways can this be done?

Example 5.17. A committee of 5 persons must be selected from 5 men and 8 women. How many ways can the selection be done if there are at least 3 women in the committee?

5.4 Definitions of Probability

1. Classical probability: It is also called mathematical probability. Suppose there are N possible outcomes in the sample space S of an experiment. Out of these N outcomes, only n are favorable to the event E, then the probability that the event E will occur is: n(E) n P (E) = = . n(S) N Example 5.18. Try the following examples.

(a) What is the probability of getting number 6 in rolling a die? (b) What is the probability of getting one head in tossing two coins?

65 Introduction to Statistics - Stat 1011 [email protected]

(c) A die is rolled. What is the probability of getting i) an odd number, ii) a number greater than 4. (d) An urn contains 6 white and 3 black balls. If one ball is selected, what is the probability that the selected ball is black. (e) A family plans to have three children. Describe the sample space for all possible gender combinations. What is the probability that the family will have two boys? (f) Two dice are rolled. Describe the sample space. What is the probability of getting i) a sum of 10 or more, ii) a pair which at least one number is 3, iii) a sum of 8, 9 or 10, iv) one number less than 4. Solutions: (a) S = {1, 2, 3, 4, 5, 6}, E = getting number 6= {6}. Thus n(S) = 6 and n(E) = 1 n(E) 1 P (E) = = n(S) 6 (b) S = {HH,HT,TH,TT }, E = getting one head = {HT,TH}. Thus n(S) = 4 and n(E) = 2 n(E) 2 P (E) = = = 0.5 n(S) 4 (c) S = {1, 2, 3, 4, 5, 6}, n(S) = 6 i. E = getting an odd number = {1, 3, 5}. Thus n(E) = 3 n(E) 3 P (E) = = = 0.5 n(S) 6 ii. E = getting number > 4 = {5, 6}. Thus n(E) = 2 n(E) 2 P (E) = = n(S) 6 2. Empirical probability: It is based on a relative frequency. Given a frequency distribution, the probability of an event being in a given class is f P (E) = P f where f is the class frequency and P f = n is the total number of observations.

The difference between classical and empirical probability is that the former uses sample space to determine the numerical probability while the latter is based on fre- quency distribution.

66 Introduction to Statistics - Stat 1011 [email protected]

Example 5.19. Given the following frequency distribution. Grade A B C D F No. of students 10 20 50 15 5 What is the probability of selecting a student who scored B? 3. Subjective probability: It calculates probability based on an educated guess or experience or evaluation of a problem. For example a physician might say that on the basis of his/her diagnosis, there is a 30% chance the patient will need an operation. Examples: 1. A committee of 5 persons must be selected from 5 men and 8 women. What is the probability that the committee consists of at least 3 women? 2. An urn contains 6 white, 4 red and 9 black balls. If three balls are drawn at random, what is the probability that: (a) 2 of the balls drawn are white. (b) 1 is of each color. (c) none is red. (d) at least one is white. Solutions:

1. Total number of ways of selection=13C5 = n(S), At least three women= 8C35C2 + 8C45C1 + 8C55C0 = n(E)

8C35C2 + 8C45C1 + 8C55C0

13C5

2. Total=19C3

(a) E = 2 of the balls drawn are white. n(E) = 6C213C1

6C213C1

19C3

(b) E = 1 is of each color. n(E) = 6C14C19C1

6C14C19C1

19C3

(c) E = None is red. n(E) = 4C015C3

4C015C3

19C3

(d) E = At least one is white. n(E) = 6C113C2 + 6C213C1 + 6C313C0

6C113C2 + 6C213C1 + 6C313C0

19C3

67 Introduction to Statistics - Stat 1011 [email protected]

5.5 Some Rules of Probability

1. The probability of an event always lies in between 0 and 1, that is, 0 ≤ P (E) ≤ 1.

• P (E) = 0, means it is sure that E can never happen. • P (E) = 1, means the event E is certain to occur (E occurs surely)

Example: What is the probability of getting

(a) number 9 in rolling a die. Ans: 0 (b) a number less than 7 in rolling a die. Ans: 1 P 2. The sum of the probabilities of each outcome in the sample space S is 1, i.e., pi = 1.

Example: Consider the experiment of rolling a die.

Outcome 1 2 3 4 5 6 Probability 1/6 1/6 1/6 1/6 1/6 1/6

P ⇒ pi = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 6/6 = 1.

Example: Suppose that only three outcomes are possible in an experiment: a1, a2 and a3. Suppose further more that a1 is twice as likely to occur as a2 which is four times as likely to occur as a3. Find p1, p2 and p3.

1. If there are two events A and B, the probability that at least one of these events will occur is the sum of the probability that each event will occur minus the probability that both events will occur at the same time. That is, P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Example: A part time student is taking two courses, namely economics and statis- tics. The probability that the student will pass economics course is 0.60 and the probability of passing statistics course is 0.70. The probability that the student will pass both courses is 0.50. Find the probability that the student

(a) will pass at least one course. (b) will fail both courses.

Solution:

(a) P (E ∪ S) = P (E) + P (S) − P (E ∩ S) = 0.60 + 0.70 − 0.50 = 0.80 (b) P (E0 ∩ S0) = 1 − P (E ∪ S) = 1 − 0.80 = 0.20

68 Introduction to Statistics - Stat 1011 [email protected]

2. The complement event E (denoted by E0) is the occurrence of any outcome or event that precludes E from happening. If E0 is the complement event of E, then P (E0) = 1 − P (E).

Example: Suppose that A and B are events for which P (A) = x, P (B) = y and P (A ∩ B) = z. Express each of the following probabilities in terms of x, y and z. a) P (A0 ∪ B0) b) P (A0 ∩ B), c) P (A0 ∩ B0) d) P (A ∩ B0).

Mutually exclusive events: If two events cannot occur simultaneously, that is, one ”ex- cludes” the other, then the two events are said to be mutually exclusive. As a result, if event A and B are mutually exclusive, then P (A ∩ B) = 0. In such events, the occurrence of one stops the occurrence of the other.

Example: What is the probability of getting head and tail in tossing a coin? Ans: P (H ∩ T ) = 0.

5.6 Conditional Probability and Independence

5.6.1 Conditional Events When the outcome or occurrence of an event affects the outcome or occurrence of another event, the two events are said to be dependent (conditional).

If the events A and B are dependent to each other, the probability of event B occurring knowing that event A has already occurred is said to be the conditional probability of B P (A ∩ B) given that event A has occurred, P (B/A) = . P (A) ⇒ P (A ∩ B) = P (A)P (B/A).

Similarly, the probability of event A occurring knowing that event B has already oc- curred is said to be the conditional probability of A given that event B has occurred, P (B ∩ A) P (A/B) = . P (B) ⇒ P (B ∩ A) = P (B)P (A/B).

Remarks:

i. 0 ≤ P (A/B) ≤ 1 or 0 ≤ P (B/A) ≤ 1

ii. P (S/A) = 1 and P (A/S) = P (A)

iii. P (A1 ∩A2 ∩· · ·∩An) = P (A1)P (A2/A1)P (A3/A1 ∩A2) ··· P (An/A1 ∩A2 ∩· · ·∩An−1) Examples:

69 Introduction to Statistics - Stat 1011 [email protected]

1. Recall the previous example that a part time student who is taking two courses, economics and statistics. Find P (E/S) and P (S/E).

2. A package contains 12 resistors, 3 of which are defective. If 3 are selected, find the probability of getting

(a) no defective resistor. (b) one defective resistor. (c) all defective resistors.

3. Urn I contains 4 white balls and 5 red balls. And urn II contains 6 white balls and 8 red balls. A ball is chosen at random from urn I and put into urn II. Then a ball is chosen at random from urn II. What is the probability that the ball is white.

4. An urn contains 6 green and 4 black balls. Another urn contains 7 green and 9 black balls. Two balls are transferred from the first urn and placed in the second urn. Then one ball is taken from the latter. What is the probability that the ball drawn is from the second urn is black.

Solutions:

1. P (E/S) =, P (S/E) =

2. 12C3

5.6.2 Independent Events Two events are said to be independent if the occurrence of one does not affect the occur- rence of the other. If event A and B are independent, the probability of A occurring is in no way affected by event B having occurred or vice versa, hence, P (A∩B) = P (A)P (B).

Example

1. A coin is tossed and a die is rolled. What is the probability of getting a head on the coin or number 4 on the die.

2. An urn contains 6 white and 3 black balls. Three balls are drawn. What is the probability that all the drawn balls will be black

(a) if the selection is without replacement. (b) if the selection is done with replacement.

Solutions:

70 Introduction to Statistics - Stat 1011 [email protected]

1. Let A= getting head on the coin ⇒ P (A) = 1/2. Let B= getting number 4 on the die ⇒ P (B) = 1/6 P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) − P (A ∩ B) = 7/12 But P (A ∩ B) = P (A)P (B) = 1/12. Thus P (A ∪ B) = 7/12.

2. N = 12, n = 3 Let E1 = the first black ball selected. Let E2 = the second black ball selected. Let E3 = the third black ball selected.

• P (E1 ∩ E2 ∩ E3) = 3/9 × 2/8 × 1/7 = P (E1)P (E2/E1)P (E3/E1 ∩ E2)

• P (E1 ∩ E2 ∩ E3) = 3/9 × 3/9 × 3/9 = P (E1)P (E2)P (E3) If two events A and B are independent, then

• P (A/B) = P (A),P (B > 0)

• P (B/A) = P (A),P (A) > 0

Example: Let A and B be two events associated with an experiment. Suppose that P (A) = 0.4, P (A ∪ B) = 0.7 and P (B) = p. For what choices of p are A and B indepen- dent.

EXERCISE: A certain travel club has 1000 members. 60% of these members are males. 45% of these members pay by credit card when they travel including 175 females. If a member is selected from the travel club at random, what is the probability that:

1. the member is a female.

2. the member is a female and pays in cash.

3. the member is a male or a credit card user.

4. the member pays cash if we know that the member is a female.

Are the sex of the member and the mode of payment statistically independent events?

71 Chapter 6

Probability Distributions

6.1 Random Variable

Random variable is a variable whose values are determined by chance (with some prob- ability). It is denoted by capital letter, for example, X and its value is denoted by the corresponding small letter; xi. The set consisting of all possible values of a random variable is called range space (RX ).

If the number of possible values of a random variable X (that is, RX ) is finite or countable infinite, the random variable is called discrete random variable. That is, the possible values of X may be listed as x1, x2, ··· , xn, ··· . In the finite case the list terminates and in the countably infinite case the list continuous indefinitely.

On the other hand, if the random variable assumes an uncountable infinite number of possible values, the random variable is called a continuous random variable.

6.1.1 Probability Distribution A probability distribution is a definition of the probabilities of the values of a random variable. Based on the type of a random variable, a probability distribution can be discrete or continuous.

Discrete Probability Distribution

With each possible value xi of a discrete random variable, a number p(xi) = P (X = xi), called probability of xi is associated. The numbers p(xi), i = 1, 2, ··· must satisfy the following conditions:

i. 0 ≤ p(xi) ≤ 1 X ii. p(xi) = 1 i

72 Introduction to Statistics - Stat 1011 [email protected]

This function p defined above is called probability mass function (pmf) of the random vari- able X. The collection of pairs (xi, p(xi)), i = 1, 2, ··· is sometimes called the probability distribution of X.

Examples: 1. Construct a probability distribution for the number of heads observed in tossing a coin two times. Also plot the probability distribution using bar diagram. 2. Construct a probability distribution for the number of heads observed in tossing a coin three times and plot it. 3. Construct a probability distribution for the number of girls if a family plans to have four children. Solutions: 1. S = {HH,HT,TH,TT } Let X be the number of heads observed in tossing a coin two times. RX = {0, 1, 2}

x 0 1 2 Total P (x) 1/4 2/4 1/4 1

2. S = {HHH,HHT,HTH,HTH,THH,THT,TTH,TTT } Let X be the number of heads observed in tossing a coin two times. RX = {0, 1, 2, 3}

x 0 1 2 3 Total P (x) 1/8 3/8 3/8 1/8 1

Continuous Probability Distribution The graph of the distribution (the equivalent of the bar graph for a discrete distribution) is usually a smooth curve. The curve is described by an equation or a function, f(x), often called probability density function (pdf) satisfying the following conditions: i. f(x) ≥ 0, for all x ∈ (a, b) R b ii. a f(x)dx = 1 R d iii. P (c ≤ X ≤ d) = c f(x)dx, if a ≤ c ≤ d ≤ b Examples: 1. Show that each of the following function is pdf. ( 1, 0 ≤ x ≤ 1; (a) f(x) = 0, otherwise.

73 Introduction to Statistics - Stat 1011 [email protected]

( 2x, 0 ≤ x ≤ 1; (b) f(x) = 0, otherwise. ( e−x, x ≥ 0; (c) f(x) = 0, otherwise.

2. Find the value of b for the following function to be a pdf. ( bx2, 0 ≤ x ≤ 1; f(x) = 0, otherwise.

As continuous random variables differ from discrete random variables, consequently con- tinuous probability distributions differ from discrete ones. Some of the most important differences are listed as follows: 1. The function f(x) does not give the probability that X = x as did p(x) in the discrete case. This is because X can take on an infinite number of values and, therefore, it is impossible to assign a probability for each value x. In fact the values of f(x) is not a probability at all; hence f(x) can take any nonnegative value, including values greater than 1. 2. Since the area under the curve corresponding to a single point is zero, the probability of obtaining exactly a specific value is zero. Thus, for a continuous random variable, P (a ≤ X ≤ b) and P (a < X < b) are equivalent, which is certainly not true for discrete distributions. 3. Finding areas under curves representing continuous probability distributions involves the use of calculus and may become quite difficult. For some distributions, areas cannot even be directly computed and require special numerical techniques. Of course statistical computer programs easily calculate such probabilities.

6.1.2 Expectations of a Random Variable The mean of a random variable X is known as the of X, denoted by E(X). It is defined as:  P xp(x) if X is a discrete r.v; µ = E(X) = R xf(x) if X is a continuous r.v. The variance of the random variable X is the expected value of the square of the deviation of X from its mean.  P(x − µ)2p(x) if X is a discrete r.v; σ2 = E(X − µ)2 = R (x − µ)2f(x) if X is a continuous r.v. ⇒ σ2 = E(X − µ)2 = E(X − E(X))2 = E(X2) − (E(X))2 Examples:

74 Introduction to Statistics - Stat 1011 [email protected]

1. Find the mean number of heads observed in tossing a coin three times.

2. Find the mean of the following probability distributions. ( 1, 0 ≤ x ≤ 1; (a) f(x) = 0, otherwise. ( 3x2, 0 ≤ x ≤ 1; (b) f(x) = 0, otherwise.

Solution:

1. S = {HHH,HHT,HTH,HTH,THH,THT,TTH,TTT } Let X be the number of heads observed in tossing a coin two times. Rx = {0, 1, 2, 3}

x 0 1 2 3 Total P (x) 1/8 3/8 3/8 1/8 1

X µ = E(x) = xp(x) = 0 × 1/8 + 1 × 3/8 + 2 × 3/8 + 3 × 1/8 = 1.5 X σ2 = E(x − µ)2 = (x − µ)2p(x) = (0 − 1.5)2 × 1/8 + (1 − 1.5)2 × 3/8 × +(2 − 1.5)2 × 3/8 + (3 − 1.5)2 × 1/8 = 0.75 √ ⇒ σ = 0.75 = 0.86

R R 1 2 R 2 R 1 2 2. E(X) = µ = xf(x)dx = 0 xdx = 0.5 and σ = (x−µ) f(x) = 0 (x−0.5) dx =?

6.2 Common Discrete Distributions

6.2.1 The Binomial Distribution A binomial distribution is one of the simplest and most frequently used discrete distribution and is very useful in many practical situations involving either/or types of events.

1. Each trial has only two mutually exclusive outcomes or outcomes that can be reduced to two. One of the outcomes is labeled as ”success” and and the other as ”failure”.

2. The outcome of each trial is independent. That is, the outcome of one trial does not affect the outcome of another.

3. The probability of ’success’ remains the same from trial to trial.

75 Introduction to Statistics - Stat 1011 [email protected]

4. The experiment (trial) is performed for fixed number of times, say n.

Let X be the number of successes out of n bernoulli trials. Hence, RX = {0, 1, 2, ··· , n}. Then X follows a binomial distribution with parameters n, number of experiments per- formed and p, probability of success and, write as X ∼ Bin(n, p). Then, the probability of obtaining x successes in n trials is given by: n P (X = x) = pxqn−x, x = 0, 1, 2, ··· , n x where p is the probability of success, q = 1 − p is the probability of failure, n is number of trials and x is number of successes.

This is called the binomial distribution. The mean is E(X) = np and variance is V (X) = npq.

Examples: 1. Suppose a coin is tossed 10 times. What is the probability of getting

(a) exactly 3 heads. (b) no head. (c) at most 3 heads. (d) at least 3 heads. (e) more than 3 heads

Find the average and variance of the number of heads.

2. The probability of a man kicking into the goal is 2/3. If a person kicks 5 times, what is the probability of scoring

(a) two goals. (b) at least one goal. (c) at most 3 goals.

Find the average, variance and standard deviation of the number of goals. Solution:

1. Let X be the number of heads observed in tossing a coin 10 time, Rx = {0, 1, 2, ··· , 10} 1 1 p = P (Success) = P (Head) = ⇒ q = 1 − p = 2 2 X ∼ Bin(p = 0.5, n = 10) 10 1x 110−x ⇒ P (X = x) = ; x = 0, 1, 2, ··· , 10 x 2 2

76 Introduction to Statistics - Stat 1011 [email protected]

10 13 110−3 (a) P (X = 3) = = 3 2 2 10 10 110−0 (b) P (X = 0) = = 0 2 2 (c) P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) = (d) P (X ≥ 3) = P (X = 3) + P (X = 4) + ··· + P (X = 10) = 1 − P (X < 3) = (e) P (X > 3) = P (X = 4) + P (X = 5) + ··· + P (X = 10) = 1 − P (X ≤ 3) =

6.2.2 The Poisson Distribution Poisson distribution is another theoretical discrete probability distribution, which is useful for modeling certain real situations. It differs from binomial distribution in the sense that it is not possible to count the number of failures even though the number of successes is known. For example, in the case of patients coming to hospital for emergency treatment, only the number of patients arriving in a given hour is known but it is not possible to count the number of patients not coming for emergency treatment in that hour. Accordingly, it is not possible to determine the total number of outcomes (successes and failures) and hence binomial distribution cannot be applied as a decision making tool. In such situation the poisson distribution should be used if the average number of patients arriving for emer- gency treatment per hour is given. It is assumed that such arrival of patients is a random phenomenon and hence exact number of patients arriving in any hour is not predictable. Other examples of poisson distribution are number of telephone calls going to a switch board system, the number of cars in a certain parking lot, number of customers coming to a bank for service and so on. All of these arrivals can be described by a discrete random variable that takes on integer values (0, 1, 2,··· ).

Let X be the number of successes in a specific period of time. Hence, RX = {0, 1, 2, ···}. Then X follows a poisson distribution with parameter λ, average number of events per unit of time and, write as X ∼ Poisson(λ). Hence, the probability of getting x successes in the same time period is:

e−λλx P (X = x) = , x = 0, 1, 2, ··· x! where λ is the average number of events per unit of time. Properties:

1. The probability of success, p, is very small.

2. The experiment is performed indefinitely (n is very large).

3. The average number of events per unit of time (λ) is known.

77 Introduction to Statistics - Stat 1011 [email protected]

The mean and variance number of successes for a poisson distribution are the same, i.e, E(X) = V ar(X) = λ. Examples:

1. On average a typist commits 3 errors per page. Find the probability that she will make

(a) no mistake. (b) two mistakes. (c) more than one mistake.

2. Customer arrive at a photocopying machine at an average rate of two every 10 min- utes. What is the probability that there will be

(a) no arrivals during any period of ten minutes. (b) exactly one arrival during these time period. (c) more than two arrivals during this time period.

Solution:

1. Let X be the number of errors committed, Rx = {0, 1, 2, ···}, E(X) = λ = 3 X ∼ Possion(λ = 3) e−3 3x ⇒ P (X = x) = , x = 0, 1, 2, ··· x! e−3 30 (a) P (X = 0) = = 0! e−3 32 (b) P (X = 2) = = 2! (c) P (X > 1) = P (X = 2) + P (X = 3) + ··· = 1 − P (X ≤ 1) =

6.3 Common Continuous Distributions

6.3.1 The Normal Distribution The most often used continuous probability distribution is the normal or gaussian distri- bution. This distribution plays a very important and pivotal role in and practice, particularly in the area of statistical inference and statistical quality control. Its importance is due to the fact that in practice, the experimental results, very often seem to follow the normal distribution or the bell-shaped curve.

78 Introduction to Statistics - Stat 1011 [email protected]

If X is a continuous random variable having a normal distribution with mean µ and variance σ2, write as X ∼ N(µ, σ2), its density function is: 1 x − µ 1 − ( )2 f(x) = √ e 2 σ , −∞ < x < ∞ 2πσ

Knowing the values of these two parameters, µ and σ2, completely determine the distri- bution. Thus, the function describes a family of curves which may may differ only with regard to µ and σ2, but have the same characteristics.

Several interesting features can be determined from this function without really evaluating it. Some of these features are:

1. The random variable X can take on any value from −∞ to ∞.

2. The curve is symmetric about the mean. This means that the number of units in the data below the mean is the same as the number of units above the mean. This means the mean and median have the same value.

3. The height of the curve is maximum at the mean value. Thus, the mean and mode values coincide. This means the normal distribution has the same value for the mean, median and mode.

4. The curve declines as we go in either direction from the mean, but never touches the base (x-axis) so that the tails of the curve on both sides extend indefinitely.

5. The corresponding deciles, quartiles and percentiles are at equidistant from the mean.

Calculating Probabilities Under the Normal Distribution The primary use of probability distributions is to find probabilities of the occurrence of specified values of the random variable. For a normally distributed random variable X, the probability between two values x1 and x2 is defined as:

1 x − µ 2 Z x2 1 − ( ) P (x1 < X < x2) = √ e 2 σ dx. x1 2πσ But, integration of this function is quite complicated and is never directly used to calcu- late such probabilities. Fortunately normal distribution can easily be standardized, which allows to use a single table for any normal distribution.

Suppose X has a normal distribution with mean µ and variance σ2, i.e, X ∼ N(µ, σ2). If X − µ we define Z = , then Z will have a normal distribution with mean 0 and variance σ

79 Introduction to Statistics - Stat 1011 [email protected]

1, that is, Z ∼ N(0, 1).

Such normal distribution with mean µ = 0 and variance σ2 = 1 is called a standard normal distribution. Hence, the pdf of the standard normal variate Z is given by: 1 1 − z2 f(z) = √ e 2 , −∞ < z < ∞ 2π

Now for any standard normal variate Z, the probability (area) between two values z1 and z2 is defined as: 1 2 Z z2 1 − z P (z1 < Z < z2) = √ e 2 dz. z1 2π Hence, x − µ X − µ x − µ ⇒ P (x < X < x ) = P ( 1 < < 2 ) 1 2 σ σ σ = P (z1 < Z < z2)

The total area under the (standard) normal curve is 1. Hence, the area to the right and left of the central value (µ = 0) of the standard normal distribution is 0.5 (as it is symmetric about 0). • P (Z > z) = P (Z < −z).

• P (Z < z) = 1 − P (Z > z). Examples: 1. Find the area to between 0 and 1.96; P (0 < Z < 1.96). Solution: P (0 < Z < 1.96) = 0.4750 = P (−1.96 < Z < 0)

2. Find the area to the right of -1.96; P (Z > −1.96). Solution: P (Z > −1.96) = P (Z > 0) + P (0 < Z < 1.96) = 0.5 + 0.4750 = 0.975

3. Find the area to the right of 2; P (Z > 2). Solution: P (Z > 2) = P (Z > 0) − P (0 < Z < 2) = 0.5−??? = 0.0228

4. Find the area to the left of -0.5; P (Z < −0.5). Solution: P (Z < −0.5) = P (Z > 0.5) = P (Z > 0) − P (0 < Z < 0.5) = 0.5−??? = 0.3085

5. Find the area between -1 and 1.5; P (−1 < Z < 1.5). Solution: P (−1 < Z < 1.5) = P (−1 < Z < 0) + P (0 < Z < 1.5) = P (0 < Z < 1) + P (0 < Z < 1.5) = 0.7745

6. Suppose that X is normally with µ = 10 and σ2 = 20 (or σ = 4.472). Find

80 Introduction to Statistics - Stat 1011 [email protected]

(a) P (X > 15) (b) P (5 < X < 15) (c) P (5 < X < 10)

Solutions: 15 − µ 15 − 10 (a) P (X > 15) = P (Z > ) = P (Z > ) = P (Z > 1.12) = P (Z > σ 4.472 0) − P (0 < Z < 1.12) = 0.1314 5 − µ 15 − µ 5 − 10 15 − 10 (b) P (5 < X < 15) = P ( < Z < ) = P ( < Z < ) = σ σ 4.472 4.472 P (−1.12 < Z < 1.12) = 0.7372 5 − µ 10 − µ 5 − 10 10 − 10 (c) P (5 < X < 10) = P ( < Z < ) = P ( < Z < ) = σ σ 4.472 4.472 P (−1.12 < Z < 0) = 0.3686

If the concern is to find the values of z for given probability values, the form of notation often called the zα notation can be adopted. According to this notation, zα, is the value of z such that P (Z > zα) = α. This definition results in the equivalent statements P (Z < −zα) = α and because of the symmetry of the normal distribution, P (−zα/2 < Z < zα/2) = 1 − α. Examples: 1. Find the value z associated P (|Z| < z) = 0.10.

2. The IQ score of students is normally distributed with a mean of 120 and variance 400. What is the probability that a student will have an IQ

(a) between 100 and 130. (b) above 140. (c) below 150. (d) between 140 and 150.

3. Let X be the variable representing the distribution of scores in statistics course. It can be assumed that these scores are normally distributed with µ = 75 and σ = 10. If the instructor wants no more than 10% of the class to get an A, what should be the cutoff grade? That is, what is the value of x such that P (X > x) = 0.10? Solutions: 1. P (|Z| > z) = 0.10 ⇒ P (|Z| > z) = P (Z < −z) + P (Z > z) = 0.5 − P (0 < Z < z) + 0.5−P (0 < Z < z) = 1−2P (0 < Z < z) = 0.10 ⇒ P (0 < Z < z) = 0.45 ⇒ z = 1.65

2. Let X be IQ score. X ∼ N(120, 400). a − µ b − µ a − 120 b − 120 P (a < X < b) = P ( < Z < ) = P ( < Z < ) σ σ 20 20

81 Introduction to Statistics - Stat 1011 [email protected]

100 − 120 130 − 120 (a) P (100 < X < 130) = P ( < Z < = P (−1 < Z < 3) = 20 20 0.5328 140 − 120 (b) P (X > 140) = P (Z > ) = P (Z > 1) = 0.1587 20 150 − 120 (c) P (X < 150) = P (Z < ) = P (Z < 1.5) = 0.9332 20 140 − 120 150 − 120 (d) P (140 < X < 150) = P ( < Z < ) = P (1 < Z < 1.5) = 20 20 0.0919 x − µ x − 75 3. P (X > x) = P (Z > z) = P (Z > ) = P (Z > ) = 0.10 σ 10 x − 75 ⇒ z = = 1.28 ⇒ x = 87.8. 10 Therefore, the instructor should assign an A grade to those students with scores 87.8 or higher.

6.3.2 Other Continuous Distributions The t Distribution The t distribution is quite similar to the normal in that it is symmetric and bell shaped. However, this distribution also has only one parameter, the degrees of freedom; hence the t distribution with v degrees of freedom is denoted by t(v). The t distribution has ”flatter” tails than the normal. That is, it has more probability in the extreme or tail areas than does the normal distribution, but barely noticeable if the degrees of freedom exceed 30 or so. In fact, when the degrees of freedom are ∞, the t distribution is identical to the standard normal distribution.

The χ2 Distribution The χ2 distribution is usually denoted by χ2(v), where v is the degrees of freedom. χ2 values are nonnegative. The shape of the distribution is different for each value of v. For large values of v (usually greater than 30), the χ2 distribution is approximated by normal.

The F Distribution It is a continuous and right skewed distribution. It is indexed by two degree of freedom parameters v1 and v2; these are usually integers and written as F (v1, v2).

82 Chapter 7

Sampling Techniques

7.1 Basic Concepts

• Population: is defined statistically as the totality of all subjects having certain common characteristics that are being under study in a specific area and time.

• Population Size: It is the total number of elements in the population.

• Sample: is a subset of the population that being studied with the aim of estimating the characteristics of the population.

• Sampling Frame: is a list of all elements of the population. The sampling frame forms the basic material from which a sample is drawn. Hence, it should be complete and up-to-date.

• Sampling Unit: The population may be regarded as consisting of units which are to be used for the purpose of sampling. Each unit is regarded as individual and indivisible when the selection is made. Such a unit is known as a sampling unit.

• Sample Size: It is the total number of elements in the sample. That is, the size of the sample is the number of sampling units which are selected from the population by a random method.

• Sampling: is the process of selecting a sample from the population.

• Sampling Technique: is a method of selecting a sample.

7.2 Reasons for Sampling

There are two broadly classified investigations: survey and sample survey. In the census method, a 100% inspection of the population is made and, each and every unit of the population is enumerated. It enables to obtain information about each and every

83 Introduction to Statistics - Stat 1011 [email protected] element in the population. The latter method is a study in which some elements which are assumed representatives of the population are investigated. It is a statistical process in which we select and examine a sample instead of considering the whole population.

In practice, it may not be possible to collect information on all units of the population. One reason is lack of resources in terms of money, personnel and equipment. Another reason is that sample survey enables us to obtain results on time. Hence, for getting quick results sampling is preferred. Also, sampling helps to get data of good quality as the number of enumerators’ decreases we can train and supervise them well in the process of data collection. Moreover, complete investigation may be destructive in nature. And samples reduce the damages caused by some tests in quality control. For example, in cooking food mothers check whether the food has enough amount of salt, spices, butter and so on, by taking a small amount and testing it. What would happen if the test is all what is in the dish?

7.3 Types of Errors

1. Sampling Errors: Sampling errors are the errors which are introduced due to errors in the selection of a sample or the discrepancies between population parameters and estimates which are derived from random sample. These errors are due to sampling fluctuations which are the outcome of the random sampling process. These errors can be controlled by proper choice of sampling methods and increasing sample size.

2. Nonsampling Errors: It is experienced that studies based on complete enumeration do not yield similar results in repeated enumerations. Such a discrepancy occurs due to many errors which are termed as nonsampling errors. Some of the sources of such errors are observation error or response error, errors in editing and tabulation of data. These errors can be minimized through superior management of survey, employing benefiting personnel and by using modern computational aids.

7.4 Types of Sampling Techniques

In the selection of a sample, always the effort is to make the sample a true representative of the population. There several sampling methods which can be broadly classified into two categories; probability and non-probability sampling methods.

In probability sampling, each unit in the population has an equal chance of being included in the sample. In the non-probability sampling, the units are drawn using ceratin amount of judgement.

84 Introduction to Statistics - Stat 1011 [email protected]

7.4.1 Probability Sampling Techniques 1. Simple Random Sampling: In simple random sampling, each and every member of the population has an equal and independent chance of being selected in the sample. The items that get selected are purely a matter of chance. Before applying this method, a complete list of all members, sampling frame, should be prepared so that each member can be identified by a distinct number. There are two methods that can be used in order to ensure the of the selection. These are:

(a) Lottery Method: This method is useful in comparatively small size of pop- ulation. All members in the population are numbered or named on separate pieces of paper of identical size and shape. These slips of paper are then iden- tically folded and mixed up in a container. The probability of the first item being selected out of the total number of N slips of paper is 1/N, for the second particular piece, this probability is 1/(N − 1), since N − 1 slips of papers left in the container after the first slip has been drawn. Similarly, the probability of the third slip being picked up is 1/(N − 1) and so on. The items from the container are selected successively until the desired sample size reached. This would constitute a random sample called simple random sample. (b) Random Number Table Method: A random number table is giving numbers in a random order which are generated using computer. In the lottery method, the selection may subject to human as people may identify the slips (chits) in many ways. The inconvenience of preparing slips of paper, shuffling them and choosing the items one by one may be avoided by the use of random number table. This principle involved in this method is also same as that in the lottery method.

Suppose N is a k digit number. Choose k digit numbers from the random number table and read out the numbers continuously, vertically or horizontally. If the number is greater than N but less than the biggest multiple of N which has k figures, divide that number by N and take the remainder r and include the rth unit in the sample. Discard random numbers which are greater than the biggest multiple of N with k figures. For example, if N = 43 take 2 digit random numbers. If the number is, say, 23 include the unit with number 23 in the sample. If the second number is 68, since it is less than 86, the biggest 2 digit multiple of 43, divide 68 by 43 and take the remainder, 25 and include the unit with number 25 in the sample. If the number obtained is greater than 86, discard the number and go to the next in the table. This process continues until n sampling units are selected.

2. Systematic Sampling: A systematic sample is formed by selecting the first unit at random, and the remaining units in the sample are automatically selected in some predetermined pattern. The process requires that the members of the population

85 Introduction to Statistics - Stat 1011 [email protected]

be presented in some kind of order; alphabetically or numerically or in any other order, and every kth unit (k is called the sampling interval (k = N/n)) is included in the sample after the first item has been selected randomly. This may be considered representative as the sample is evenly distributed over the whole population. There are two methods of systematic sample selection. These are:

(a) Linear Systematic Sampling: Suppose N is a multiple of n, that is, N = nk. The procedure is to select a random number, say, j such that 1 ≤ j ≤ k and then select the jth and every subsequent j + k, j + 2k, ··· , ((n − 1)k)th positional units. This sampling plan is known as linear systematic sampling. But the situation that N is a multiple of does not always hold, in such case a sample of n − 1 units, instead of n, will be obtained. (b) Circular Systematic Sampling: This is applied when N 6= nk. Hence, take N/n as k by rounding to the nearest integer. Select a random number from 1 to N, let the number be m. Now select every (m + jk)th unit when m + jk < N and select every (m + jk − N)th unit when m + jk > N putting j = 1, 2, ··· till n units are selected. By this method always a sample size of n will be obtained.

3. Stratified Sampling: When the population is heterogenous with respect to the characteristic in which one is interested, stratified sampling should be adopted so that it would be a representative sample. The heterogeneous population is divided in homogeneous sub-groups, called strata, ensuring maximum uniformity within each stratum and largest degree of variability among strata. From each stratum a separate sample is selected using simple random sampling. This sampling method is known as stratified sampling.

As an example of stratified sampling, assume that an investigator is interested in se- curing a particular response that would be representative of undergraduate Haramaya University students. S/he may stratify the population into four strata of freshman, sophomore, junior and senior students, and take a simple random sample from each stratum.

4. : In this sampling the population is divided into subpopulations known as clusters. But, the units within a cluster are relatively heterogenous com- pared to the entire population. From each cluster, a random sample of the desired size will be selected. The value of cluster sampling depends on how representative each cluster is of the entire population. If all clusters are similar in this regard, then sampling a small number of clusters will provide good estimates of the population parameters.

5. : This method of sampling is useful when the population is very widely spread and random sampling is not possible. Although cluster sampling is advantageous under certain circumferences, it is generally less efficient than sampling

86 Introduction to Statistics - Stat 1011 [email protected]

of individual units directly. In such a case the whole population is divided into a number of primary units called stages, each of which is composed of second stage of units. A serious of samples are then taken at successive stages. The sample size at each stage is determined by the relative population size at each stage.

7.4.2 Non-probability Sampling Techniques In some situations, judgement or purposive sampling is preferred to probability sampling. For example if one want a sample of persons who are suffering from cancer, s/he has to select cancer patients who happen to come to hospital(s).

Nonprobability sampling gives rise to those methods where the subjects are selected delib- erately. No probability is attached or can be computed for an item being selected.

1. Quota Sampling: In case of stratified sampling if the cost of selecting sampling units from each stratum is very high, then the investigator is assigned a quota (fixed number of subjects) in each stratum. Then the actual selection of persons is left at the discretion of the investigator.

2. Judgment Sampling: In this method, sampling units are selected on the judgement of the person doing the study. The underlying assumption is that the unit selected truly represent the entire population. For example to find out the potential of drip irrigation technology, a researcher may go the teachers of Agricultural University.

3. Convenience Sampling: Here, an investigator selects the sample at his own conve- nience. This method is based on the assumption that the population is homogeneous and the individuals selected and interviewed similar information with regard to the characteristic under study. For example, persons selected from gas stations or petrol pumps to collect information about the quality of gas or petrol, service or correct- ness of the measurement, e.t.c are supposed to represent the population of gasoline buyers.

4. Snowball Sampling: Snowball sampling technique involves the practice of iden- tifying set of respondents who can, in turn, help the investigator to identify some other person who will be included in the study. After interviewing this person, s/he will contact the other person and interview him/her. In this way, a chain process continuous till the required number of persons are interviewed. This type of sampling is most suitable in qualitative research. It is frequently used in marketing research.

87 Chapter 8

Statistical Inference for a Single Population

As noted earlier, one of the primary objectives of a statistical analysis is to use data from a sample to make inferences about the population from which the sample was drawn. In this section, the basic procedures for making such inferences are presented. Statistical inference generally takes two forms, namely, estimation of the parameter and testing of a hypothesis.

8.1 Estimation

For the purpose of general discussion, let θ be the population parameter and θˆ be the corresponding . As already stated, the parameter θ is unknown. The value of the statistic θˆ is computed from the random sample taken from the population.

The statistic θˆ intended for estimating a parameter θ is called an estimator of θ. The specific numerical value of an estimator calculated from the sample is called the estimate. The process of obtaining an estimate of the unknown value of a parameter by a statistic is called estimation. There are two types of estimations. One is the point estimation and the other is interval estimation.

8.1.1 Point Estimation Point estimation is the process of obtaining a single sample value (point estimate) that is used to estimate the desired population parameter. The estimator is known as point estimator. For example, X¯ is a point estimator of µ and S is a point estimator of σ.

The best estimator should be highly reliable and have desirable properties like unbiased- ness, consistency, efficiency and sufficiency. These criteria are described as follows:

1. Unbiasedness: An estimator is a random variable since it is always a function of the sample values. The expected value of a sample statistic is considered to be an

88 Introduction to Statistics - Stat 1011 [email protected]

unbiased estimator if it equals the population parameter which is being estimated. This means E(θˆ) = θ.

2. Consistency: It refers to the effect of sample size on the accuracy of the estimator. A statistic is said to be consistent estimator of the population parameter if it approaches the parameter as the sample size increases, that is, θˆ → θ as n → N.

3. Efficiency: An estimator is considered to be efficient if its value remains stable from sample to sample. The best estimator would be the one which would have the least variance from sample to sample. From the three point estimators of central tendency, namely, the mean, median and mode, the mean is considered the least variant and hence is a better estimator for the population mean.

4. Sufficiency: An estimator is said to be sufficient if it uses all the information about the population parameter contained in the sample. For example, the sample mean uses all the sample values in its computation while median and mode do not. Hence, mean is the better estimator in this sense.

8.1.2 Interval Estimation Point estimator has some drawbacks. First, a point estimator from the sample may not exactly locate the population parameter, that is, the value of point estimator is not likely to be exactly equal to the value of the parameter, resulting in some margin of uncertainty. If the sample value is different from the population value, the point estimator does not indicate the extent of the possible error. Second, a point estimate does not specify as to how confident we can be that the estimate is close to the parameter it is estimating. That is, we cannot attach any degree of confidence to such an estimate as to what extent it is closer to the value of the parameter. Because of these limitations of point estimation, interval estimation is considered desirable. Interval estimation involves the determination of an interval (a range of values) within which the population parameter must lie with a specified degree of confidence. It is the construction of an interval on both sides of the point estimate within which wan reasonably confident that the true parameter will lie.

8.2 Hypothesis Testing

A statistical hypothesis is a conjecture (an assumption) about a population parameter which may or may not be correct. Such a hypothesis usually results from speculation con- cerning observed behavior, natural phenomena, or established theory.

Hypothesis testing is a statistical procedure which leads to take a decision about an as- sumption for the population parameter(s) for being correct or not using sample data. It starts by making a set of two statements about the parameter(s) in question. These are

89 Introduction to Statistics - Stat 1011 [email protected] usually expressed in the form of simple mathematical relationships involving the param- eters. These two statements are exclusive and exhaustive, which means that one or the other statement must be true. The first statement is called null hypothesis and is denoted by H0 and, the second is called and is denoted by H1. • Null Hypothesis: is a statement about the values of one or more parameters. It represents the status quo, that is, it states that there is no difference between a parameter and a hypothesized value. For any parameter θ and an assumed value θ0,

H0 : θ = θ0 ⇔ θ − θ0 = 0.

• Alternative Hypothesis: It is often called research hypothesis. An alternative hypothesis is a statement that contradicts the null hypothesis, that is, states that there is a difference between a parameter value and a hypothesized value. Hence, such hypothesis may have three different forms:

– Two-sided test: H1 : θ 6= θ0 ⇒ θ − θ0 6= 0 – One-sided test:

∗ Right tailed test: H1 : θ > θ0 ⇒ θ − θ0 > 0

∗ Left tailed test: H1 : θ < θ0 ⇒ θ − θ0 < 0

8.2.1 Basic Concepts in Hypothesis Testing Errors in Hypothesis Testing There are two types of errors in hypothesis testing.

• Type I Error: It is an error occurred if one rejects the null hypothesis which is actually true. The probability of making such error is denoted by α and called significance level. This significance level (α) is the maximum acceptable probability of rejecting a true null hypothesis.

• Type II Error: It is an error occurred if one failed to reject the null hypothesis which is actually false. The probability of making this type II error is denoted by β. The is obtained as 1−β which is the probability of correctly rejecting the null hypothesis when it is false.

Steps in Testing a Hypothesis A statistical hypothesis test can be formally summarized as a five-step process. These are:

• Step 1: State both hypotheses; H0 and H1. • Step 2: Specify an acceptable level of significance, α.

90 Introduction to Statistics - Stat 1011 [email protected]

• Step 3: Define a sample based test statistic (Tcal) and rejection region (Ttab) for H0.

• Step 4: Make a decision, that is either reject or do not reject H0. • Step 5: Conclusion.

In step 1, H0 and H1 are the null and alternative hypotheses, respectively, defined before while α is the level of significance. The most common choices of significance levels are α = 0.1, α = 0.05 and α = 0.01. In step 3, the test statistic is a sample statistic whose can be specified for both the null and alternative hypothesis case (although the sampling distribution when the alternative hypothesis is true may often be quite complex). After specifying the appropriate significance level α, the sampling distribution of this statistic is used to define the rejection region. The rejection (critical) region is the range of values of a sample statistic that will lead to rejection of the null hypothesis. It comprises of the values of the test statistic for which (1) the probability when the null hypothesis is true is less than or equal to the specified α and (2) probabilities when H1 is true are greater than they are under H0. Regard to making decision, for a two-sided test reject H0 if |Tcal| ≥ Ttab, for right tailed test reject H0 if Tcal ≥ Ttab and for left tailed test reject H0 if Tcal ≤ −Ttab.

8.2.2 Hypothesis Testing for a Population Mean

1. H0 : µ = µ0 H1 : µ 6= µ0 / µ < µ0 / µ > µ0 2. Significance level = α.

3. Test Statistic and Rejection Region: x¯ − µ • If n is large, say n ≥ 30, the test statistic Z = √ ∼ N(0, 1). cal σ/ n x¯ − µ • If n is small, say n < 30, use the test statistic t = √ ∼ t(n − 1). cal s/ n 4. Decision:

• For a two-tailed test, if |Tcal| ≥ Tα/2, H0 will be rejected.

• For a right tailed test, if Tcal ≥ Tα, H0 will be rejected.

• For a left tailed test, if Tcal ≤ −Tα, H0 will be rejected. 5. Conclude.

Examples:

91 Introduction to Statistics - Stat 1011 [email protected]

1. Assume that the average annual income for government employees in Ethiopia is reported by the Ethiopian Statistical Agency Census Bureau to be birr 18750.00. There was some doubt whether the average yearly income of government employees in Ethiopia was representative of the national average. A random sample of 100 government employees in Ethiopia was taken and it was found that their average salary was birr 19240.00 with a standard deviation of birr 2610.00. Can we say that the average salary of government employees in Ethiopia is representative of the national average at 5% level of significance? 2. A research done by a graduating student reports that the average score of Haramaya University students in statistics course is less than 80. To test this claim, a random sample of 10 students was taken and their scores in the course are recorded as: 65, 70, 80, 85, 60, 90, 80, 75, 85, 90. At 0.05 level of significance, test the validity of this claim. Solutions

1. Given µ0 = 18750, n = 100,x ¯ = 19240 and s = 2610.

(a) H0 : µ = 18750 H1 : µ 6= 18750

(b) α = 0.05 ⇒ Ztab = Zα/2 = Z0.025 = 1.96

x¯ − µ0 19240 − 18750 (c) Zcal = √ = √ = 1.877 s/ n 2610/ 100

(d) Since |Zcal| < Ztab, H0 should not be rejected. (e) Thus, the average salary of government employees in Ethiopia is not significantly different from the national average at 5% level of significance.

2. Given µ0 = 80, n = 10 n 1 X 1 x¯ = x = (65 + 70 + ... + 90) = 78 n i 10 i=1 n 1 X √ s2 = (x − x¯)2 = 106.67 ⇒ s = 106.67 = 10.33 n − 1 i i=1

(a) H0 : µ = 80 H1 : µ < 80

(b) α = 0.05 ⇒ ttab = tα(n − 1) = t0.05(9) = 1.833

x¯ − µ0 78 − 80 (c) tcal = √ = √ = −0.612 s/ n 10.33/ 10

(d) Since tcal > −ttab, H0 should not be rejected. (e) Thus, the average score of Haramaya University students in statistics course is less than 80 at 5% level of significance.

92 Introduction to Statistics - Stat 1011 [email protected]

8.2.3 Confidence Interval for a Population Mean The (1 − α)100% confidence interval for the Population Mean µ is: √ √ • (¯x − tα/2(n − 1) × s/ n, x¯ + tα/2(n − 1) × s/ n) if n is small. √ √ • (¯x − Zα/2 × s/ n, x¯ + Zα/2 × s/ n) if n is large. Example: Construct the 95% confidence interval for the population mean of the previous two examples. Solutions: √ √ √ 1. (¯x − Z√α/2 × s/ n, x¯ + Zα/2 × s/ n) = (19240 − 1.96 × 2610/ 100, 19240 − 1.96 × 2610/ 100) = (18728.44, 19751.56) √ √ √ 2. (¯x − tα/2(n − 1)√× s/ n, x¯ + tα/2(n − 1) × s/ n) = (78 − 2.262 × 10.33/ 10, 78 − 2.262 × 10.33/ 10) = (70.61, 85.39)

93 Chapter 9

Inference for Two or More Populations

The inferences we have made so far have concerned a parameter from a single population. There may be situations that need comparison of parameters from different populations.

Comparative studies are designed to discover and evaluate the difference between effects rather than the effect themselves. In such studies, we must perform an experiment, collect informative data and then reach at a decision based on the results. In general discussion, the statistical term ’treatment’ is used to refer to techniques that will be compared. In performing an experiment, the basic units exposed to one or another treatment is called experimental units (subjects). The characteristics recorded after the treatment is applied to the units is called a response. the manner in which subjects are chosen and assign to treatment is called experimental design.

9.1 Comparison of the Population Mean in Two groups

In studies involving the comparison of two groups, there are two ways of taking a sample: paired sample and independent sample.

9.1.1 Paired Sample Here a pair of similar (identical) experimental units are selected and a treatment is applied on one member of each pair. Then the response of interest is record on each pair. A com- mon application occurs when the response is measured on two different occasions. This is appropriate for pre-post treatment responses.

For two paired variables X1 and X2, the difference of the two variables, di = x1i − x2i i = 1, 2, ··· , n, is treated as if it were a single sample. The null hypothesis is that the true mean difference of the two variables is µd = µ1 − µ2 = D0. The difference is typically

94 Introduction to Statistics - Stat 1011 [email protected] assumed to be zero unless explicitly specified.

The steps to be followed is similar to the one we have seen in the one sample case. 1. The null and alternative hypotheses to be tested are:

H0 : µd = 0

H1 : µd 6= 0 or µd < 0 or µd > 0

2. Choose a level of significance (α) 3. The test statistic is: d¯− µ t = √ d ∼ t(n − 1) sd/ n n n 1 X 1 X where d¯ = d is the sample mean of the differences, s2 = (d − d¯)2 is n i d n − 1 i i=1 i=1 the sample variance of the differences and n is the sample size. 4. Decision:

• For a two sided test, H0 is rejected if |t| > tα/2(n − 1).

• For a one sided case, H0 is rejected if |t| > tα(n − 1). 5. Conclude.  s  Also, the (1 − α)100% confidence interval for µ is d¯± t (n − 1)√d . d α/2 n Example: A medical researcher wishes to determine if a pill has an effect on reducing the blood pressure of individuals. The study involves recording the initial blood pressure of 15 women. After they took the pill for six months, their blood pressure are again recorded. The data is: Women 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Before 70 80 72 76 76 76 72 78 82 64 74 92 74 68 84 After 68 72 62 70 58 66 68 52 64 72 74 60 74 72 74 Do the data substantiate the claim that the pill reduced blood pressure? Also construct the 95% confidence interval for the mean difference of blood pressure.

Solution: Let µd be the population mean of the difference in the blood pressure of women. The differences of the before-after blood pressures are: Women 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Before (x1i) 70 80 72 76 76 76 72 78 82 64 74 92 74 68 84 After (x2i) 68 72 62 70 58 66 68 52 64 72 74 60 74 72 74 di = x1i − x2i 2 8 10 6 18 10 4 26 18 -8 0 32 0 -4 10

95 Introduction to Statistics - Stat 1011 [email protected]

n 1 X 1 1 The sample mean of the differences is d¯= d = (2 + 8 + ··· + 10) = (132) = 8.8. n i 15 15 i=1 n 1 X 1 The variance of the differences is s2 = (d −d¯)2 = {(2−8.8)2 +(8−8.8)2 + d n − 1 i 15 − 1 i=1 1 ···+(10−8.8)2} = (1686.4) = 120.457 which implies the standard deviation s = 10.98. 14 d 1. Thus, the null and alternative hypotheses to be tested are:

H0 : µd = 0

H1 : µd > 0

2. The level of significance is α = 0.05. The critical value is tα(n − 1) = t0.05(15 − 1) = t0.05(14) = 1.76 3. The test statistic is: d¯− µ 8.8 − 0 t = √ d = √ = 3.10 sd/ n 10.98/ 15

4. Decision: Since |t| > t0.05(14), H0 should be rejected. 5. Conclusion: The pill reduced the blood pressure of women. The 95% confidence interval is:     ¯ sd 10.98 d ± tα/2(n − 1)√ = 8.8 ± 2.145 √ = (2.72, 14.88). n 15

9.1.2 Independent Samples In many sampling situations, we will select independent random samples from two groups in order to compare the population means. Let µ1 be the population mean of the first group and µ2 be the populations mean of the second group. First let us consider small sample size case. 1. The null and alternative hypotheses to be tested are:

H0 : µ1 = µ2

H1 : µ1 6= µ2 or µ1 < µ2 or µ1 > µ2

2. Choose a level of significance (α). 3. Under the assumption of equal variances in the two groups, the test statistic is:

(¯x1 − x¯2) − (µ1 − µ2) t = ∼ t(n1 + n2 − 2) r 1 1 sp + n1 n2

96 Introduction to Statistics - Stat 1011 [email protected]

n n 1 X1 1 X2 wherex ¯ = x is the sample mean of the first group andx ¯ = x is 1 n 1i 2 n 2i 1 i=1 2 i=1 2 2 2 (n1 − 1)s1 + (n2 − 1)s2 the sample mean of the second group, sp = is the pooled n1 + n2 − 2 n 1 X1 variance of the both groups (note s2 = (x − x¯ )2 is the sample variance 1 n − 1 1i 1 1 i=1 n 1 X2 of the first group and s2 = (x − x¯ )2 is the sample variance of the second 2 n − 1 2i 2 2 i=1 group), n1 is sample size of the first group and n2 is sample size of the second group. 4. Decision:

• For a two sided test, H0 is rejected if |t| > tα/2(n1 + n2 − 2).

• For a one sided case, H0 is rejected if |t| > tα(n1 + n2 − 2). 5. Conclude.

The (1 − α)100% confidence interval for the difference of the population means is:

 r 1 1  (¯x1 − x¯2) ± tα/2(n1 + n2 − 2)sp + . n1 n2 The above test statistic is only used when the two distributions have the same variance. When the two population variances are assumed to be different and hence must be esti- mated separately, the test statistic is a little bit modified as:

(¯x1 − x¯2) − (µ1 − µ2) t = s ∼ t(v) s2 s2 1 + 2 n1 n2

2 2 2 (s1/n1 + s2/n2) where now the v = 2 2 2 2 . Similarly, the (1 − α)100% (s1/n1) /(n1 − 1) + (s2/n2) /(n2 − 1) confidence interval for the difference of the population means when the population variances are different is:  s  2 2 s1 s2 (¯x1 − x¯2) ± tα/2(v) +  . n1 n2

Example: Company officials were concerned about the length of time a particular drug product retained its toxin’s potency. A random sample of 10 bottles of the product was drawn from the production line and measured for potency. A second sample of 10 bottles was obtained and stored in a regulated environment for a period of one year. The readings obtained from each sample are given below.

97 Introduction to Statistics - Stat 1011 [email protected]

Sample 1 10.2 10.5 10.3 10.8 9.8 10.6 10.7 10.2 10.0 10.6 Sample 2 9.8 9.6 10.1 10.2 10.1 9.7 9.5 9.6 9.8 9.9 Test the null hypothesis that the drug product retains its potency. Also, construct the 95% confidence interval for the difference of the population means.

Solution: Let µ1 be the mean potency of the product taken from the production line and µ2 be the mean potency of the drug product that was retained for a year. The of the data are: n 1 X1 1 x¯ = x = (103.7) = 10.37 1 n 1i 10 1 i=1 n 1 X2 1 x¯ = x = (98.3) = 9.83 2 n 2i 10 2 i=1 " n n # 1 X1 1 X1 1 1 s2 = x2 − ( x )2 = [1076.3 − (103.7)2] = 0.105 1 n 1i n 1i 9 10 1 i=1 1 i=1 " n n # 1 X2 1 X2 1 1 s2 = x2 − ( x )2 = [966.81 − (98.3)2] = 0.058 2 n 2i n 2i 9 10 2 i=1 2 i=1 2 2 2 (n1 − 1)s1 + (n2 − 1)s2 (10 − 1)(0.105) + (10 − 1)(0.058) sp = = = 0.0815 n1 + n2 − 2 10 + 10 − 2

sp = 0.285 1. The hypotheses to be tested are:

H0 : µ1 = µ2

H1 : µ1 6= µ2

2. The level of significance is α = 0.05. Thus t0.05/2(10 + 10 − 2) = t0.025(18) = 2.101. 3. The test statistic is: (¯x − x¯ ) − (µ − µ ) (10.31 − 9.83) − 0 t = 1 2 1 2 = = 3.766 r 1 1 r 1 1 sp + 0.285 + n1 n2 10 10

4. Decision: Since |t| > t0.025(18), H0 is rejected. 5. Conclusion: There is a significant difference in the mean potency of the drug product from the production line and the drug that was retained for one year.

The 95% confidence interval for the difference of the population means, µ1 − µ2 is: !  r 1 1  r 1 1 (¯x1 − x¯2) ± tα/2(n1 + n2 − 2)sp + = (10.37 − 9.83) ± 2.101(0.285) + n1 n2 10 10 = (0.272, 0.808).

98 Introduction to Statistics - Stat 1011 [email protected]

Example: A quick but impressive method of estimating the concentration of a chemical in a rat has been developed. The sample from this method has 8 observations and the sample from the standard method has 4 observations. Assuming different population variances, test whether the quick method gives under-estimate result. The data in the two samples are: Standard Method 25 24 25 26 Quick Method 23 18 22 28 17 25 19 16

Solution: Let µ1 = the population mean of the standard method and µ2 = the population mean of the quick method. 1. The hypotheses to be tested are:

H0 : µ1 = µ2

H1 : µ1 > µ2

2. The level of significance is α = 0.05. The degrees of freedom for unequal variances assumption is

2 2 2 2 (s1/n1 + s2/n2) (0.67/4 + 17.71/8) v = 2 2 2 2 = 2 2 ≈ 8. (s1/n1) /(n1 − 1) + (s2/n2) /(n2 − 1) (0.67/4) /(4 − 1) + (17.71/8) /(8 − 1)

Thus t0.05(8) = 1.86 3. The test statistic is:

(¯x1 − x¯2) − (µ1 − µ2) (25 − 21) − 0 t = s = r = 2.60 2 2 0.67 17.71 s1 s2 + + 4 8 n1 n2

4. Decision: H0 is rejected since t > t0.05(8). 5. Conclusion: The quick method gives an under-estimate result. For large samples, the test statistic follows a normal distribution. That is,

(¯x1 − x¯2) − (µ1 − µ2) Z = s ∼ N(0, 1). s2 s2 1 + 2 n1 n2

Also the (1 − α)100% confidence interval is  s  2 2 s1 s2 (¯x1 − x¯2) ± Zα/2 +  . n1 n2

99 Introduction to Statistics - Stat 1011 [email protected]

Similarly, if the groups have common variance, the pooled variance (standard deviation) can be calculated as shown before in the small sample case.

Example: For a random sample of 120 adult female born in country A, the mean height was 62.7 inches with standard deviation 2.50 inches. For another random sample of 150 adult female born in country B the mean height was 61.8 inches with standard deviation 2.62 inches. Would you reject the null hypothesis that there is no difference in height between adult female born in the two countries at 1% level of significance.

Solution: Let µ1 = the mean height of adult female born in country A and µ2 = the mean height of adult female born in country B.

1. The hypotheses to be tested are:

H0 : µ1 = µ2

H1 : µ1 6= µ2

2. The level of significance is α = 0.01. Thus, Z0.01/2 = Z0.005 = 2.58. 3. The test statistic is:

(¯x1 − x¯2) − (µ1 − µ2) (62.7 − 61.8) − 0 Z = s = r = 2.88 s2 s2 (2.50)2 (2.62)2 1 + 2 + n1 n2 120 150

4. Decision: H0 is rejected since Z > Z0.005. 5. Conclusion: There is a difference in the population mean height of in the two coun- tries.

9.2 Analysis of Variance (ANOVA)

The t and Z tests have been used for testing the hypothesis of a single population mean equal to a specified value or equality of two populations means when the sample size is small and large respectively. However in testing the equality of more than two population means, techniques of analysis of variance (ANOVA) will be used. Thus, the analysis of variance (ANOVA) is used to compare the means of two or more groups based on the variance ratio test, i.e., an F -test, and relating it to the F -distribution.

There are many types of observational classifications. If the observations are classified on the basis of a single criterion, the classification is called one-way classification. If the observations are classified on the basis of two criteria, it is called two-way classification.

100 Introduction to Statistics - Stat 1011 [email protected]

Here, one-way anova will be discussed. The principle underlying the one-way ANOVA is that the total variability in a data set is partitioned into two components; the variability between groups and the variation within groups. Each component represents a different source of variation. The between groups variation can be accounted for, and the within group variation is the unexplained (residual) variation results from uncontrolled biological variation and technical error.

Suppose there is one basic variable or criterion of classification with k groups. The null hypothesis to be tested is that the all the k group means are equal and the alternative hypothesis is at least one of the group mean is significantly different from the other. That is,

H0 : µ1 = µ2 = ··· = µk

H1 : not H0 To construct the test statistics, the total sum squares (TSS) is decomposed into the between sum squares (BSS) and within (errors) sum squares (ESS).

TSS = BSS + ESS

k ni k k ni X X 2 X 2 X X 2 (xij − x¯) = ni(¯xi − x¯) + (xij − x¯i) i=1 j=1 i=1 i=1 j=1 The TSS has n − 1 degrees of freedom, the BSS has k − 1 degrees of freedom and the ESS has n − k degrees of freedom. The ratios of the BSS and ESS to their corresponding degrees of freedom are called between mean squares (BMS) and error mean squares (EMS), respectively. Therefore, the test statistic is called an F test which is the ratio of BMS to EMS. In addition, the critical value is Fα(k − 1, n − k).

The ANOVA table is written as follows. Source of Variation SS df MS F k X BSS BMS Between BSS = n (¯x − x¯)2 k − 1 BMS = F = i i k − 1 EMS i=1 k n X Xi ESS Within (Error) ESS = (x − x¯ )2 n − k EMS = ij i n − k i=1 j=1 k ni X X 2 Total TSS = (xij − x¯) n − 1 i=1 j=1 Assumptions of the one-way ANOVA 1. The samples are independently and randomly drawn from source population(s).

2. The source population(s) reasonably normal distribution.

101 Introduction to Statistics - Stat 1011 [email protected]

3. The samples have approximately equal variances. If the samples are equal size, no main worry about these assumptions because anova is quite robust (relatively unperturbed by violations of its assumptions). But if the samples are different size, an appropriate non-parametric alternative for one-way ANOVA which is called the Kruskal - Wallis Test should be used.

Example: Suppose a university wishes to compare the effectiveness of four teaching meth- ods (Slide, Self Study, Lecture, Discussion) for a particular course. Twenty four students are randomly assigned to the teaching methods, with 5, 6, 6 and 7 respectively. At the end of teaching the students with their assigned method, a test (out of 20%) was given and the performance of the students were recorded as follows: Slide Self Study Lecture Discussion 9 10 12 9 12 6 14 8 14 6 11 11 11 9 13 7 13 10 11 8 5 16 6 7 Construct the ANOVA table. Also test the hypothesis that there is no difference among the four teaching methods.

Solution: The summary statistics are calculated as:

Slide (xi1) Self Study (xi2) Lecture (xi3) Discussion (xi4) 9 10 12 9 12 6 14 8 14 6 11 11 11 9 13 7 13 10 11 8 5 16 6 7 Sample Size n1 = 5 n2 = 6 n3 = 6 n4 = 7 5 6 7 7 X X X X Sum x1i = 59 x2i = 46 x3i = 77 x4i = 56 i=1 i=1 i=1 i=1 Meanx ¯1 = 11.800x ¯2 = 7.667x ¯3 = 12.833x ¯4 = 8.000 4 6 4 X X X Also, the overall summaries are: n = nj = 24, xij = 231. Thus,x ¯ = j=1 i=1 j=1 6 4 1 X X 1 x = (231) = 9.625. n ij 24 i=1 j=1

102 Introduction to Statistics - Stat 1011 [email protected]

S.V. SS df MS F 126.89 42.2967 Between BSS = 126.89 4 − 1 = 3 BMS = = 42.2967 F = = 11.28 3 3.7485 74.97 Within ESS = 74.97 24 − 4 = 20 EMS = = 3.7485 20 Total TSS = 201.85 24 − 1 = 23

Then the calculated F value is going to be compared with F0.05(3, 20). Thus, F0.05(3, 20) = 2.38. Therefore, H0 should be rejected. This means that there is a difference among the teaching methods.

Mean Separation In the ANOVA, if the null hypothesis is rejected, then there is a need to identify which pair of group means are significant and which are not. There are several methods of mean separation, of these, the Fisher’s Least Significant Difference LSD test is to be considered. In this method, first sort the group means in ascending order to compare two means at a time. For comparing µi and µj, compute s  1 1  LSDij = tα/2(n − k) EMS + . ni nj

Then, ifx ¯i − x¯j > LSDij, there is a significant difference between µi and µj. Otherwise, no significant difference is observed.

Example: Recall the previous example and identify the significant pair of teaching method means using LSD.

Solution: Lecture:x ¯3 = 12.833, Slide:x ¯1 = 11.800, Discussion:x ¯4 = 8.000, Self Study:x ¯2 = 7.667 n1 = 5, n2 = 6, n3 = 6, n4 = 7

Methodsx ¯i − x¯j LSD Sign. Lecture vs Slide 12.833-11.800=1.033 3.153p3.7485(1/6 + 1/5) = 3.696 not sig. Slide vs Discussion 11.800-8.000=3.800 3.153p3.7485(1/5 + 1/7) = 3.574 sig. Discussion vs Self 8.000-7.667=0.333 3.153p3.7485(1/7 + 1/6) = 3.396 not sig.

Therefore, lecture and slide teaching methods are better than the other two.

103 Chapter 10

Simple Linear Regression and Correlation

In the previous chapters we have been dealing with a single variable. In this chapter we will deal with a bi-variate data, i.e., data involving two variables.

10.1 Correlation

Correlation is a statistical tool desired towards measuring the degree of the relationship (degree of association) between variables. If the change in one variable affects the change in the other variable, then the variables are correlated.

Correlation that involves only two variables is called simple correlation. The simplest way to present bivariate data is to plot on the XY plane. For a bivariate distribution (X,Y ), the values (Xi,Yi), i = 1, 2, ··· ,N are plotted in the XY plane. This is known as . This gives an idea about the correlation of the two variables. But, it will give only a vague idea about the presence and absence of correlation and the nature (direct or indirect) of correlation. It will not indicate about the strength or degree of relationship between two variables.

10.1.1 Covariance Covariance is a measure of the joint variation between between two variables, i.e., it measures the way in which the values of the two variables vary together.

Recall that the sample and population variance of a certain variable X is calculated,

104 Introduction to Statistics - Stat 1011 [email protected]

respectively, as:

n 1 X S2 = (X − X¯)2 x n − 1 i i=1 n 1 X = (X − X¯)(X − X¯) n − 1 i i i=1

= Sxx

and

N 1 X σ2 = (X − X¯)2 x N i i=1 N 1 X = (X − X¯)(X − X¯) N i i i=1

= σxx Similarly the sample covariance between two variables is defined as:

n 1 X S = (X − X¯)(Y − Y¯ ) xy n − 1 i i i=1  n n  X X X Y  n i i  1 X i=1 i=1  =  XiYi −  . n − 1  n   i=1 

and

N 1 X σ = (X − X¯)(Y − Y¯ ) xy N i i i=1  N N  X X X Y  N i i  1 X i=1 i=1  =  XiYi −  . N  N   i=1 

If the covariance is zero, there is no linear relationship between the two variables. If it is negative, there is an indirect linear relationship between them. If the covariance is positive, there is a direct linear relationship between the variables.

105 Introduction to Statistics - Stat 1011 [email protected]

10.1.2 Pearson’s Correlation Coefficient The coefficient of correlation, which was developed by Karl Pearson, is a measure of the degree or strength of the linear association between two variables. It is defined as a ratio of the covariance between the two variables and the product of the standard deviations of the two variables. The sample correlation coefficient is denoted by r and the population correlation coefficient is denoted by the Greek letter ρ, rho.

n X ¯ ¯ (Xi − X)(Yi − Y ) Sxy i=1 r = = v v SxSy u n u n uX ¯ 2uX ¯ 2 t (Xi − X) t (Yi − Y ) i=1 i=1

This can be written as:

n n n X X X n XiYi − Xi Yi i=1 i=1 i=1 r = v v u n n !2u n n !2 u X 2 X u X 2 X tn Xi − Xi tn Yi − Yi i=1 i=1 i=1 i=1

Interpretations of r: The value of the correlation coefficient can be positive or negative, depending on the sign of the covariance between the two variables. But, it lies between the limits -1 and +1; that is −1 ≤ r ≤ 1.

• If the value of r is -1 or +1, there is a perfect negative/inverse/indirect or posi- tive/direct linear relationship between the variables, respectively.

• If the value of r is approximately -1 or +1, there is a strong negative/inverse/indirect or positive/direct linear relationship between the variables, respectively.

• If the value of r approximately -0.5 or +0.5, there is a medium negative/inverse/indirect or positive/direct linear relationship between the variables, respectively.

• If the value of r is near zero, there is no linear association between the two variables.

Limitations of r:

1. If X and Y are statistically independent, the correlation coefficient between them is zero; but the converse is not always true. In other words, zero correlation does not necessarily imply independence. It is a measure of linear association or linear dependence only; it has no meaning for describing nonlinear relations. Thus, for example even if Y = X2 is an exact relationship, yet r is zero. (Why?)

106 Introduction to Statistics - Stat 1011 [email protected]

2. Although, it is a measure of the linear association between variables, it does not necessarily imply any cause and effect relationship. Example: A researcher wants to find out if there is a relationship between the heights of sons and the heights of their fathers. In other words, do taller fathers have taller sons? The researcher took a random sample of 6 fathers and their 6 sons. Their height in inches is given below in an ordered array. Father (X) 63 65 66 67 67 68 Son (Y ) 66 68 65 67 69 70 Find the covariance and correlation coefficient and interpret.

Solution: No. X Y X2 Y 2 XY 1 63 66 3969 4356 4158 2 65 68 4225 4624 4420 3 66 65 4356 4225 4290 4 67 67 4489 4489 4489 5 67 69 4489 4761 4623 6 68 70 4624 4900 4760 P P P 2 P 2 P Total Xi = 396 Yi = 405 Xi = 26152 Yi = 27355 XiYi = 26740 ⇒ r = 0.597

10.1.3 Spearman’s Rank Correlation It is not always possible to take measurements on units or objects. Many characters are expressed in comparative terms such as beauty, smartness, temperament, .... In such cases the units are ranked pertaining to that particular character instead of taking measurements on them. Sometimes, the units are also ranked according to their quantitative measure. In these type of studies, two situations arise, (i) the same set of units is ranked according two characters, (ii) two judges give ranks to the same set of units independently pertaining to one character. In both these situations we get paired ranks for a set of units. For example, the students are ranked according to their marks in Mathematics and Statistics. Two judges rank the girls independently in a beauty competition. In all these situations, the usual Pearson’s correlation coefficient cannot be obtained.

Suppose that a group of n individuals is given grades or ranks with respect to two char- th acteristics. Let (Xi,Yi), i = 1, 2, ··· , n be the ranks of the i individual on the two characteristics. Then, the Spearman’s rank correlation coefficient is given by: n X 2 6 di r = 1 − i=1 where d = R − R . s n(n2 − 1) i Xi Yi

107 Introduction to Statistics - Stat 1011 [email protected]

This formula is used when all the ranks are not repeated. For repeated ranks, a correction factor is required. If ties occur between the pair of measurements, it creates no problem.

m(m2 − 1) If m is the number of times an item is repeated, then the factor is added to 2 n X 2 di . For each repeated value, this correction factor is to be added. i=1

n ! X 2 X 6 di + CF r = 1 − i=1 s n(n2 − 1)

Note that −1 ≤ rs ≤ 1.

Example: The ranks of some 10 students in two courses Statistics and Economics are given below. Calculate the rank correlation and interpret. Statistics 5 2 9 8 1 10 3 4 6 7 Economics 10 5 1 3 8 6 2 7 9 4

Ans: rs = −0.31 Example: Obtain the rank correlation for the following data. X 85 74 85 50 65 78 74 60 74 90 Y 78 91 78 58 60 72 80 55 68 70

Ans: rs = −0.545

10.2 Simple Linear Regression

Regression may be defined as the estimation of the unknown value of one variable from the known values of one or more variables. The variable whose values are to be estimated is known as dependent variable while the variable which are used in determining the value of the dependent variable are called independent variables.

The regression study that involves only two variables is called simple regression and the that studies more than two variables is called multiple regression. If the relation ship between the two variables can be described by a straight line then the regression is known as linear regression other wise it is called non-linear.

The regression analysis involving only two variables and having a linear relationship is called simple linear regression. This linear relationship between the two variables is repre- sented by a straight line.

108 Introduction to Statistics - Stat 1011 [email protected]

A regression line is a line that gives the best estimate of one variable for any given value of another variable. The regression line which is used to estimate the values of Y for any given value of X is called regression line of Y on X.

Model: Yi = β0 + β1Xi + εi; i = 1, 2, ··· , n where

th Yi is the i actual value of the dependent variable.

th Xi is the i actual value the independent variable.

β0 is the intercept.

β1 is the slope.

th 2 εi is i value the error term, which is εi ∼ N(0, σ ) This model is called the Population Regression Model. Its parameters are interpreted as follows:

• β0 is the value of the dependent variable when the value of the independent variable is zero.

• β1 is the increment in the value of the dependent variable when the value of the independent variable increases by 1 unit. The sign of β1 is the same as to that of the covariance and correlation coefficient. That is, there is a direct linear relationship between the two variables if β1 is positive,there is an indirect linear relationship between the two variables if β1 is negative, and there is no linear relationship between the two variables if β1 is zero.

10.2.1 Method of Estimation

The objective in the above model is to estimate the regression parameters, β0 and β1 using sample data. The most common and widely used method of estimation is called Ordinary (OLS) which minimizes the error sum of squares. The estimated regression model is, therefore,

ˆ ˆ ˆ Yi = β0 + β1Xi; i = 1, 2, ··· , n where ˆ th Yi is the i fitted/estimated value of the dependent variable.

th Xi is the i actual value the independent variable. ˆ β0 is the estimated intercept. ˆ β1 is the estimated slope.

109 Introduction to Statistics - Stat 1011 [email protected]

The estimates of the parameters can be obtained as: n n n n X ¯ ¯ X X X (Xi − X)(Yi − Y ) n XiYi − Xi Yi ˆ Sxy i=1 i=1 i=1 i=1 β1 = = = S n n n !2 xx X ¯ 2 X X (Xi − X) 2 n Xi − Xi i=1 i=1 i=1 and ˆ ¯ ˆ ¯ β0 = Y − β1X Example: Recall the previous example on heights of sons and heights of fathers. a. Estimate the regression model of height of sons on height of fathers. b. Interpret the estimated parameters. c. What would be the predicted height of the son if the fathers height is 70 inches? ˆ ˆ Solutions: β1 = 0.625 and β0 = 26.25

10.2.2 The Coefficient of Determination So far, we were concerned with the problem of estimation the parameters of the regression model and the correlation coefficient between two variables. We now consider the goodness of fit of the estimated model to a set of data; that is, we shall find out how ”well” the estimated model fits the data.

The coefficient of determination tells how well the estimated model fits the data. For simple linear regression (two variables case), it is defined as the square of the sample cor- relation coefficient, and denoted by r2. Hence r2 measures the proportion or percentage of the variation in the dependent variable explained by the independent variable. Variation means the sum of the squares of the deviation of a variable from its mean value. Generally, r2 is a nonnegative quantity which lies in the limits 0 and 1, i.e., 0 ≤ r2 ≤ 1. If it approaches to 1, it means a good fit and if it approaches 0, no relationship between the variables.

If we consider the example on heights of sons and their fathers, we had r = 0.597 which implies r2 = 0.357. This means 35.7% of the variation in the heights of sons is explained by the heights of the fathers.

Example: A study was reported in a medical journal suggesting that the peak heart rate of an individual can reach during intensive exercise decreases with age. A cardiologist wants to do his own study. The next 9 patients were given a stress test on the treadmill at 6 miles per hour and their ages and their heart rates were recorded as follows:

110 Introduction to Statistics - Stat 1011 [email protected]

Age 30 30 40 20 20 45 30 45 50 Hear Rate 190 180 180 200 195 170 185 175 165

a. Identify the dependent and independent variable.

b. Estimate the regression model.

c. Can we predict the peak heart rate of an 80 year old man who is given a similar stress test? If so, what peak heart rate do you predict.

d. Calculate the coefficient of correlation and coefficient of determination, and interpret the results.

Solutions: No. X Y X2 Y 2 XY 1 30 190 900 36100 5700 2 30 180 900 32400 5400 3 40 180 1600 32400 7200 4 20 200 400 40000 4000 5 20 195 400 38025 3900 6 45 170 2025 28900 7650 7 30 185 900 34225 5550 8 45 175 2025 30625 7875 9 50 165 2500 27225 8250 Total P X = 310 P Y = 1640 P X2 = 11650 P Y 2 = 299900 P XY = 55525

a. dependent= Heart rate and independent= Age. ˆ b. Yi = 216.37 − 0.99Xi; i = 1, 2, ··· , 9 c. Yˆ = 216.37 − 0.99 × 80 = 137.17

d. r = −0.95 and r2 = 0.90

THE END!!!

111