PRACTICAL MANUAL STATISTICAL METHODS

( UG COURSE)

Compiled by

DEPARTMENT OF MATHEMATICS AND Jawaharlal Nehru Krishi Vishwa Vidyalaya, JABALPUR 482 004

1

2

Contents

S. Chapter Name Description Page No. No. 1 Graphical Representation of 1. Construction of Discrete and continuous 1-8 2. Construction of Bar Diagram, , Pie Diagram, Frequency curve and Frequency polygon 2 Measures of 1. Definition, Formula and Calculation of , 9-21 , , and for grouped and ungrouped data 2. Definition, Formula and Calculation of Quartiles, Deciles and for grouped and ungrouped data 3 Measures of Dispersion 1. Definition, Formula and Calculation of 22-29 absolute measures of Dispersion, , Quartile Deviation, Mean Deviation, 2. Definition, Formula and Calculation of relative measures of Dispersion, CD and CV for grouped and ungrouped data 4 Moments, and 1. Definition and types of moments, skewness 30 -40 and Kurtosis 2. Formula and calculation of raw moments, moments about origin, central moments and different types of coefficient of skewness and kurtosis 5 Correlation and Regression 1. Definition and types of Correlation and 41-49 Regression. 2. Calculation of Correlation and regression coefficient along with their test of significance. 6 Test of Significance 1. Definition of Null and Alternative Hypothesis 50-59 and different tests of significance 2. Application of t test for single mean, t-test for independent samples, paired t test, F-test, Chi-square test 7 Analysis of ( One way 1. Definition and steps of analysis of one way 60-79 and Two way classification) and two way classification. 2. Analysis of CRD and RBD as an example of one way and two way ANOVA 8. Methods 1. Definition of SRS, SRSWR and SRSWOR and 80-86 difference between and sampling 2. Procedures of selecting a simple random

3

4

1. Graphical Representation of data Mujahida Sayyed Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India Email id : [email protected]

Frequency Distribution: A tabular presentation of the data in which the frequencies of values of a variable are given along with class is called a frequency distribution. Two types of frequency distribution are available 1. Discrete Frequency Distribution: A frequency distribution which is formed by distinct values of a discrete variable eg. 1,2,5 etc. 2. Continuous Frequency Distribution: A frequency distribution which is formed by distinct values of a continuous variable eg. 0-10, 10-20, 20-30 etc. Process: For construction of Discrete Frequency Distribution Step I. Set the data in ascending order. Step II. Make a blank table consisting of three columns with the title: Variable, Tally Marks and Frequency. Step III. Read off the observations one by one from the data given and for each one record a tally mark against each observation. In tally marks for each variable fifth frequency is denoted by cutting the first four frequency from top left to bottom right and then sixth frequency is again by a straight tally marks and so on. Step IV. In the end, count all the tally marks in a row and write their number in the frequency column. Step V. Write down the total frequency in the last row at the bottom.

Objective : Prepare a discrete frequency distribution from the following data Kinds of data: 5 5 2 6 1 5 2 9 5 4 3 4 11 7 2 5 12 6 Solution : First arrange the data in ascending order 1 2 2 2 3 4 4 5 5 5 5 5 6 6 7 9 11 12 Prepare a table in the format described above in the process. Count the numbers by tally method we get the required discrete frequency distribution: No. of Letters, Variable (X) Tally Marks No. of Words, Frequency(f) 1 │ 1 2 │││ 3 3 │ 1 4 ││ 2 5 ││││ 5 6 ││ 2 7 │ 1 9 │ 1 11 │ 1 12 │ 1 Total 18

1

Continuous Frequency Distribution: A continuous frequency distribution i.e. a frequency distribution which obtained by dividing the entire range of the given observations on a continuous variable into groups and distributing the frequencies over these groups . It can be done by two methods 1. Inclusive method of class intervals : When lower and upper limit of a class interval are included in the class intervals. 2. Exclusive method of class intervals: When the upper limit of a class interval is equal to the lower limit of the next higher class intervals. Process: For construction of Continuous Frequency Distribution Step I. Set the data in ascending order. Step II. Find the range= max value –min value. Step III. Decide the approximate number k of classes by the formula K= 1+3.322 log10N, where N is the total frequency. Round up the answer to the next integer. After dividing the range by number of classes class interval is obtained. Step IV. Classify the data by exclusive and/or inclusive method for the desired width of the class intervals. Step V. Make a blank table consisting of three columns with the title: Variable, Tally Marks and Frequency. Step VI. Read off the observations one by one from the data given and for each one record a tally mark against each observation. Step VII. In the end, count all the tally marks in a row and write their number in the frequency column. Step VIII. Write down the total frequency in the last row at the bottom. ********************************************************************************

Objective : Prepare a continuous grouped frequency distribution from the following data. Kinds of data: 20 students appear in an examination. The marks obtained out of 50 maximum marks are as follows: 5, 16, 17, 17, 20, 21, 22, 22, 22, 25, 25, 26, 26, 30, 31, 31, 34, 35, 42 and 48. Prepare a frequency distribution taking 10 as the width of the class-intervals .

Solution: Arrange the data in the ascending order 5 16 17 17 20 21 22 22 22 25 25 26 26 30 31 31 34 35 42 48 Here lower limit is 5 and upper limit is 48.

Since it is given that the desired class interval is 10, so frequency distribution for Inclusive Method of Class intervals:

Marks Tally Marks No. of students 1-10 │ 1 11-20 ││││ 4 21-30 │││││││ │ 9 31-40 ││││ 4 41-50 ││ 2 Total 20

2

Exclusive Method of Class intervals:

Marks Tally Marks No. of students 0-10 │ 1 10-20 │││ 3 20-30 │││││││ │ 9 30-40 │││ │ 5 40-50 ││ 2 Total 20

******************************************************************************** Conversion of Inclusive series to Exclusive series: To apply any statistical technique (Mean, Median etc.) , first the inclusive classes should be converted to exclusive classes. 푙표푤푒푟 푙𝑖푚𝑖푡 표푓 푠푒푐표푛푑 푐푙푎푠푠−푢푝푝푒푟 푙𝑖푚𝑖푡 표푓 푓𝑖푟푠푡 푐푙푎푠푠 For this purpose we find the difference of and add 2 this amount to upper limit of first class and subtract it from the lower limit of next higher class. ퟏퟏ−ퟏퟎ In the present example the conversion factor = = 0.5. So we add 0.5 to 10 and subtract 0.5 ퟐ from 11 and finally get the exclusive classes 1-10.5, 10.5-20.5, etc. ********************************************************************************

Graphical Representation of data:- Graphical Representation is a way of analysing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram. It is easy to understand and it is one of the most important learning strategies. It always depends on the type of information in a particular domain. There are different types of graphical representation. Some of them are as follows • Bar Diagram – Bar Diagram is used to display the category of data and it compares the data using solid bars to represent the quantities. • Histogram – The graph that uses bars to represent the frequency of numerical data that are organised into intervals. Since all the intervals are equal and continuous, all the bars have the same width. • Pie diagram–Shows the relationships of the parts of the whole. The circle is considered with 100% and the categories occupied is represented with that specific percentage like 15%, 56% , etc. • Frequency Polygon – It shows the frequency of data on a given number to curve. • Frequency curve - Frequency curve is a graph of frequency distribution where the line is smooth.

Merits of Using Graphs Some of the merits of using graphs are as follows: • The graph is easily understood by everyone without any prior knowledge. • It saves time. • It allows to relate and compare the data for different time periods • It is used in statistics to determine the mean, median and mode for different data, as well as in interpolation and extrapolation of data.

3

1. Simple Bar Diagram: Bar graph is a diagram that uses bars to show comparisons between categories of data. The bars can be either horizontal or vertical. Bar graphs with vertical bars are sometimes called vertical bar graphs. A bar graph will have two axes. One axis will describe the types of categories being compared, and the other will have numerical values that represent the values of the data. It does not matter which axis is which, but it will determine what bar graph is shown. If the descriptions are on the horizontal axis, the bars will be oriented vertically, and if the values are along the horizontal axis, the bars will be oriented horizontally.

Objective : Prepare a simple Bar diagram for the given data: Kinds of data: Aggregated figures for merchandise export in India for eight years are as Follows. Years 1971 1972 1973 1974 1975 1976 1977 1978 Exports (million Rs.) 1962 2174 2419 3024 3852 4688 5555 5112

Solution: For Simple Bar Diagram Step I: Draw X and Y axis. Step II: Take year on X axis . Step III: Take scale of 1000 on Y axis which represent Exports. Step IV: Draw the equal width bars on X axis.

6000 Bar Diagram 5000

4000

3000

2000 Export

Export (million Rs.) 1000

0 1971 1972 1973 1974 1975 1976 1977 1978 Year

Results: The above figure shows the Bar diagram. ******************************************************************************** 2. Histogram : Histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The horizontal axis is more or less a number line, labelled with what the data represents. The vertical axis is labelled either frequency or relative frequency (or percent frequency or probability). The histogram (like the stemplot) can give the shape of the data, the center, and the spread of the data. The shape of the data refers to the shape of the distribution, whether normal,

4 approximately normal, or skewed in some direction, whereas the center is thought of as the middle of a data set, and the spread indicates how far the values are dispersed about the center. In a skewed distribution, the mean is pulled toward the tail of the distribution. In histogram the area of rectangle is proportional to the frequency of the corresponding range of the variable.

Objective: Construction of the bar diagram for the given data: Kinds of data: The following data are the number of books bought by 50 part-time college students at College; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,2, 2, 2, 2, 2, 2, 2, 2, 2, 2,, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,4, 4, 4, 4, 4, 4,5, 5, 5, 5, 5,6, 6 Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy four books. Five students buy five books. Two students buy six books. Calculate the width of each bar/bin size/interval size. Solution: Process: Step I: The smallest data value is 1, and the largest data value is 6. To make sure each is included in an interval, we can use 0.5 as the smallest value and 6.5 as the largest value by subtracting and adding 0.5 to these values. A small range here of 6 (6.5 – 0.5), so a fewer number of bins; let’s say six this time. So, six divided by six bins gives a bin size (or interval size) of one. Step II: Notice that may choose different rational numbers to add to, or subtract from, maximum and minimum values when calculating bin size. Step III :

Result: The above histogram displays the number of books on the x-axis and the frequency on the y-axis: ********************************************************************************

3. PIE Diagram: Pie charts are simple diagrams for displaying categorical or . These charts are commonly used within industry to communicate simple ideas, for example market share. They are used to show the proportions of a whole. They are best used when there are only a handful of categories to display. A consists of a circle divided into segments, one segment for each category. The size of each segment is determined by the frequency of the category and measured by the angle of the segment. As the total number of degrees in a circle is 360, the angle given to a segment is 360◦

5 times the fraction of the data in the category, that is Angle = (Number in category /Total number in sample (n)) × 360, or we can also express pie diagram in %.

Objective: Draw a pie chart to display the information. Kinds of data: A family's weekly expenditure on its house mortgage, food and fuel is as follows: Expense Rupees Mortage 300 Food 225 Fuel 75

Solution: Process: Step I: The total weekly expenditure = 300+225+75 = 600Rs. Step II:

Percentage of weekly expenditure on: 300 Mortage = ∗ 100% = 50% 600

225 Food = ∗ 100% = 37.5% 600

75 Fuel = ∗ 100% = 12.5% 600 Step III: To draw a pie chart, divide the circle into 100 percentage parts. Then allocate the number of percentage parts required for each item.

Result: Above figure shows the pie diagram of the given data. *******************************************************************************

4. Frequency Polygon: Frequency polygon are analogous to line graph, and just as line graph make continuous data visually easy to interpret, so too do frequency polygons. It can also be obtained by joining the mid points of the class interval on x-axis and their corresponding frequency on y-axis by a straight line. Step I: Examine the data and decide on the number of intervals and resulting interval size, for both the x-axis and y-axis. Step II: The x-axis will show the lower and upper bound for each interval, containing the data values, whereas the y-axis will represent the frequencies of the values. Step III: Each data point represents the frequency for each interval. Step IV: If an interval has three data values in it, the frequency polygon will show a 3 at the upper endpoint of that interval. Step V: After choosing the appropriate intervals, begin plotting the data points. After all the points are plotted, draw line segments to connect them.

6

Objective: construction of frequency polygon from the frequency table. Kinds of data: Frequency Distribution for Calculus Final Test Scores

Lower Bound Upper Bound Mid Value Frequency

49.5 59.5 54.5 5

59.5 69.5 64.5 10

69.5 79.5 74.5 30

79.5 89.5 84.5 40

89.5 99.5 94.5 15 Solution:

Result: Above figure shows the frequency polygon diagram of the given data. *******************************************************************************

5. Frequency curve: The frequency-curve for a distribution can be obtained by drawing a smooth and free hand curve through the mid-points of the upper sides of the rectangles forming the histogram. 50

40

30

20

10

0 44.5 54.5 64.5 74.5 84.5 94.5 104.5

Result: Above figure shows the frequency curve diagram of the given data. ****************************************************************************

7

Exercise:

Q1. Define Graphical Representation. Also write the advantage of Graphical representation of data?

Q2. The following data gives the information of the number of children involved in different activities. Activities Dance Music Art Cricket Football No. of Children 30 40 25 20 53 Draw Simple bar Diagram. Q3. The percentage of total income spent under various heads by a family is given below. Clothin Different Heads Food Health Education House Rent Miscellaneous g % age of Total 40% 10% 10% 15% 20% 5% Number Represent the above data in the form of bar graph. Q4. The following table shows the numbers of hours spent by a child on different events on a working day. Represent the adjoining information on a pie chart. Activity School Sleep Playing Study TV Others No. of Hours 6 8 2 4 1 3

Q5. Make a frequency table and histogram of the following data: 3, 5, 8, 11, 13, 2, 19, 23, 22, 25, 3,10, 21,14, 9,12,17 ,22, 23, 14 *******************************************************************************

8

Measures of Central Tendency Umesh Singh Assistant Professor (Statistics), College of Agriculture , Tikamgarh, 472001,India Email id : [email protected]

According to Professor Bowley, Averages are "statistical constants which enable us to comprehend in a single effort the significance of the whole." They give us an idea about the concentration of the values in the central part of the distribution. Plainly speaking, an average of a statistical series is the value of the variable which is representative of the entire distribution. The following are the five measures of central tendency that are in common use: (i) (ii) Median (iii) Mode (iv) Geometric Mean (v) Harmonic Mean Requisites for an ideal Measure of Central Tendency The following are the characteristics to be satisfied by an ideal measure of central tendency (i) It should be rigidly defined. (ii) It should be readily comprehensible and easy to calculate. (iii) It should be based on all the observations. (iv) It should be suitable for further mathematical treatment. (v) It should be affected as little as possible by fluctuations of sampling. (vi) It should not be affected much by extreme values.

1. Arithmetic Mean:

Arithmetic mean of a set of observations is their sum divided by the number of observations.

Arithmetic mean for ungrouped data: The arithmetic mean X of n observations X1,X2, X3. . .,X n is given by n X X + X + X ...... + X  i X = 1 2 3 n = i=1 n n Arithmetic mean for grouped data : In case of frequency distribution, Xi/fi, i = 1, 2, 3, 4,……n, where fi is the frequency of the variable Xi; n n

fiXi fiXi n f1X1 + f2X2 + f3X3...... + fnXn i=1 i=1   X = = n = , fi = N f1 + f2 + f3...... fn N  i=1  fi i=1 In case of grouped or continuous frequency distribution X is taken as the mid-value of the corresponding class.

Remark. The Greek Capital letter, Σ Sigma, is used to indicate summation of elements in a set or a sample or a population. It is usually indexed by an index to show how many elements are to be summed. Properties or Arithmetic Mean Property 1. Algebraic sum of the deviations of a set of values from their arithmetic mean is zero. If n Xi / fi, i= 1, 2, ... , n is the frequency distribution, then  fi (X i − X ) = 0, X being the mean of i=1 distribution. 9

Property 2. The sum of the squares of the deviations of a set of values from their arithmetic mean is always minimum. Property 3. Mean of the composite series- If Xi, (i = 1, 2, ... , k) are the of k-component series of sizes ni ( i = 1, 2, ... , k) respectively, then the mean X of the composite series obtained on combining the component series is given by the formula: n X + n X + n X +...... n X X = 1 1 2 2 3 3 k k n1 + n2 + n3.....+ nk ******************************************************************************** Objective: Find the arithmetic mean of the following ungrouped data: Kinds of data : Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12. Solution: We know that

n X  i 10+7+11+9+9+10+7+9+12 84 Arithmetic mean X = i=1 = = =9.33 n 10 9 ******************************************************************************* Objective: Find the arithmetic mean of the following discrete frequency distribution: Kinds of data: Xi 2 9 16 35 32 89 95 65 55 fi 8 2 5 7 6 8 9 6 2 Solution – Xi 2 9 16 35 32 89 95 65 55 Total 9 fi 8 2 5 7 6 8 9 6 2 fi = 53 i=1 fiXi 16 18 80 245 192 712 855 390 110 2618 n f X  i i 2618 X = i=1 = = 49.39 n 53 fi i=1 ******************************************************************************** Objective : Find the arithmetic mean of the following continuous grouped frequency distribution: Kinds of data: Xi 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 fi 8 2 5 7 6 8 9 6 2 Solution – Class interval Xi (Mid-point) fi fiXi 0-10 5 8 40 10-20 15 2 30 20-30 25 5 125 30-40 35 7 245 40-50 45 6 270 50-60 55 8 440 60-70 65 9 585 70-80 75 6 450 80-90 85 2 170 Total 2355

10

n f X  i i 2355 X = i=1 = = 44.433 n 53 fi i=1

Objective: Find the arithmetic mean of the pooled data Kinds of data: The average of 5 numbers (first series) is 40 and the average of another 4 numbers (second series) is 50. 푛1푋̅̅̅̅+푛2푋̅̅̅̅ 5∗40+4∗50 400 Solution: we know that pooled mean formula = 1 2 = = = 44.44 푛1+푛2 5+4 9 ********************************************************************************

2. Median: Median of a distribution is the value of the variable which divides it into two equal parts. It is the va1ue which exceeds and is exceeded by the same number of observations, i.e., it is the value such that the number of observations above it is equal to the number of observations below it. The median is thus a positional average.

Median for ungrouped data: In case of ungrouped data, if the number of observations is odd then median is the middle value after the values have been arranged in ascending or descending order of magnitude. In case of odd number of observation th  n +1 Median =   term  2 

In case of even number of observations, in fact any value lying between the two middle values can be taken as median but conventionally we take it to be the mean of the middle terms. So  n n + 2   th term + thterm  2 2 Median =   2 In case of discrete frequency distribution median is obtained by considering the cumulative frequencies. The steps for calculating median are given below: (i) Find N/2, where N = ∑fi. (ii) See the (less than) cumulative frequency (cf.) just greater than N/2. (iii) The corresponding value of X is median.

******************************************************************************** Objective: Find the median of the ungrouped data when the number of observations is odd. Kinds of data: The values are 5, 20,15,35,18, 25, 40. Solution – Step 1 Arrange values in ascending order of their magnitude 5, 15, 18, 20, 25, 35, 40 Step 2 the number of observation is odd i.e. 7 n +1th So, Median = term 2 th  7 +1 Median =   term = 4th term , which is 20  2  ********************************************************************************

11

Objective: Find the median of the ungrouped data when the number of observations is even. Kinds of data: The values are 8, 20, 50, 25, 15, 30 Solution – Step 1 Arrange values in ascending order of their magnitude 8, 15, 20, 25, 30, 50 Step 2 the number of observation is even i.e. 6 In case of even number of observation  n th n + 2 th   term + term  2 2  Median =   2  6 th 6 + 2 th   term + term  2 2  Median =   2

(3rd term + 4th term) (20 + 25) Median = = = 22.5 2 2 ******************************************************************************** Median for grouped data In the case of continuous frequency distribution, the class corresponding to the c.f. just greater than N/2 is called the median class and the value of median is obtained by the following formula:

 N   − C  Median = l +  2 *h  f      Where l is the lower limit of the median class, f is the frequency of the median class, h is the magnitude of the median class, 'C' is cumulative frequency preceding to the median class, and N = ∑fi ********************************************************************************

Objective: Find the Median of the following discrete grouped frequency distribution: Kinds of data: Xi 1 2 3 4 5 6 7 8 9 Total fi 8 10 11 16 20 25 15 9 6 120

Solution : Here N = ∑fi = 120 → So, N/2 = 120/2 = 60 Xi 1 2 3 4 5 6 7 8 9 Total fi 8 10 11 16 20 25 15 9 6 120 C.f. 8 18 29 45 65 90 105 114 120

Cumulative frequency (c.f.) just greater than N/2 value i.e. 65 and the value of X Corresponding to 65 is 5. Therefore, median is 5. ********************************************************************************

Objective: Find the Median wage of the following continuous grouped frequency distribution Kinds of data: Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 No. of labours 3 5 20 10 5 7 2

12

Solution:

Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of labours 3 5 20 10 5 7 2 c.f. 3 8 28 38 43 50 52

Here N/2 = 52/2=26. Cumulative frequency just greater than 26 is 28 and corresponding class is 40-50. Thus median class is 40-50.  N   − C  Now Median = l +  2 *h  f     

 52   −8  Median = 40 +  2 *10 = 49  20      So, Median wage is Rs 49. ********************************************************************************

3. Mode Mode is the value which occurs most frequently in a set of observations and around which the other items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the series. Mode is the value which occurs most frequently in a set of observations and around which the other items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the series. Thus in the case of discrete frequency distribution mode is the value of X corresponding to maximum frequency. For example, the mode of {4, 2, 4, 3, 2, 2, 1, and 2} is 2 because it occurs four times, which is more than any other number. Now look at the following discrete series:

Variable 10 20 30 40 50 55 60 89 94 Frequency 2 3 12 30 25 11 9 7 3 Here, as you can see the maximum frequency is 30, the value of mode is 40. In this case, as there is a unique value of mode, the data is unimodal. But, the mode is not necessarily unique, unlike arithmetic mean and median. You can have data with two modes (bi-modal) or more than two modes (multi-modal). It may be possible that there may be no mode if no value appears more frequent than any other value in the distribution. For example, in a series 1, 1, 2, 2, 3, 3, 4, 4, there is no mode.

But in anyone (or-more) of the following cases: (i) If the maximum frequency is repeated, (ii) If the maximum frequency occurs in the very beginning or at the end of the distribution, and ' (iii) If there are irregularities in the distribution, the value of mode is determined by the method of grouping. This is illustrated below by an example.

Objective: Find the mode of the following frequency distribution: Kinds of data:

Size ( X) 1 2 3 4 5 6 7 8 9 10 11 12 Frequency (f) 3 8 15 23 35 40 32 28 20 45 14 6

Solution: Here we see that the distribution is not regular since the frequencies are increasing steadily up to 40 and then decrease but the frequency 45 after 20 does not seem to be consistent with the distribution. Here we cannot say that since maximum frequency is 45, mode is 10. Here we shall locate mode by the method of grouping as explained below:

13

The frequencies in column (i) are the original frequencies. Column (ii) is obtained by combining the frequencies two by two. If we leave the first frequency and combine the remaining frequencies two by two we get column (iii) We proceed to combine the frequencies three by three to obtain column (iv). The combination of frequencies three by three after leaving the first frequency results in column (v) and after leaving the first two frequencies results in column (vi). ' To find mode we form the following table:

Column Number Maximum Frequency Value or combination of values of X giving max. frequency (i) 45 10 (ii) 75 5, 6 (iii) 72 6, 7 (iv) 98 4, 5, 6 (v) 107 5, 6, 7 (vi) 100 6, 7, 8

We find that the value 6 is repeated maximum number of times and hence the value of mode is 6 and not 10 which is an irregular item

Mode for ungrouped data: In the case of discrete frequency distribution mode is the value of X corresponding to maximum frequency. ***************************************************************************************

Objective: Find the Mode of the following ungrouped data: Kinds of data: The values are 4, 2, 4, 3, 2, 2, 1, and 2. Solution: Here the mode is 2 because it occurs four times, which is more than any other number. **************************************************************************************

Mode for grouped data In case of continuous frequency distribution mode is given by the formula:

 f − f  1 0 Mode = l +  *h  2f1 − f0 − f2 

Where l is the lower limit of model class, h the magnitude of the model class, f1 is frequency of the modal class, f0 and f2 are the frequencies of the classes preceding and succeeding to the modal class respectively. ***************************************************************************************

14

Objective: Find the mode of the following continuous grouped frequency distribution: Kinds of data:

Class interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Frequency (f) 5 8 7 12 28 20 10 10

Solution: Here maximum frequency is 28. Thus the class 40-50 is the modal class.

So, l=40, f1=28, f0=12, f2=20, h=10,

 f − f  1 0 Mode = l +  *h  2f1 − f0 − f2 

 28 −12  Mode = 40 +  *10 =40+ 6.667= 46.667  56 −12 − 20  **************************************************************************************

4. Geometric Mean (GM)

The Geometric mean of a set of n observation is the nth root of their product. Geometric mean for grouped data- The geometric mean G of n observations xi, i=1,2,3….n is

n G = (x1.x2 x3...... xn This can be written as 1/ n G = (x1.x2 x3...... xn ) 1 n log G = log x1.log x2.log x3...... log xn  Taking log in both sides 1 log G = log x + log x + log x ...... log x  n 1 2 3 n 1 n logG = logx i n i=1

 1 n  G = Antilog  logx i   n i=1  Geometric mean for grouped data- In the case of grouped or continuous frequency distribution, X is taken to be the value corresponding to the mid-point of the class-intervals. In case of frequency distribution Xi / fi., (i = 1. 2 ...., n) geometric mean, G is given by 1 f1 f2 f3 fn N G = x1 .x2 .x3 ...... xn  , where N = ∑fi. Taking logarithm of both sides, we get 1 log G =  f log x + f log x + f log x ...... + f log x  N 1 1 2 2 3 3 n n 1 n log G =  fi log xi N i=1 Thus we see that logarithm of G is the arithmetic mean of the logarithms of the given values.  1 n  So G = Antilog   fi log xi   N i=1  *************************************************************************************** 15

Objective: Calculate Geometric mean from the following data: 3,13,11,15,5,4,2 Solution: In this example number of observation n=7, by definition of geometric mean G= (3.13.11.15.5.4.2)1/7 1 log G = log x + log x + log x ...... log x  n 1 2 3 n 1 log G = log 3+ log13+ log11+...... + log 2 7 1 log G = 0.4771+1.1139 +1.0413 +1.1760 + 0.6989 + 0.6020 + 0.3010 7 1 log G = [5.4106] 7 log G = 0.772944 So, G = Antilog( 0.772944) = 5.928 ***************************************************************************************

Objective: Calculate Geometric mean from the following continuous grouped frequency data: Kinds of data:

Class Interval 0-10 10-20 20-30 30-40 Total

Frequency 1 3 4 2 10 Solution: We know that in case of Grouped data 1 n log G =  fi log xi N i=1 Calculations are given below in the table

Class Interval Frequency Mid Values Xi log Xi fi log Xi 0-10 1 5 0.69 0.698 10-20 3 15 1.17 3.528 20-30 4 25 1.39 5.591 30-40 2 35 1.54 3.088 Total 10 12.907

12.91 After substituting the values in the formula, we get logG= = 1.29 10

Hence GM=Antilog(1.29)=19.53 **************************************************************************************

5. Harmonic Mean. Harmonic mean of a number of observations is the reciprocal of the arithmetic mean of the reciprocals of the given values.

Mean for ungrouped data: The, harmonic mean H, of n observations Xi, i = 1, 2,….., n is given by

1 H = 1 n  1    n i=1  xi  Harmonic Mean for grouped data: In case of frequency distribution Xi / fi, i = 1, 2,….., n is

16

1 H = , where N = ∑fi. 1 n  f    N i=1  xi  ***************************************************************************************

Objective: Find the harmonic mean for the following ungrouped data Kinds of data: Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12

1 1 Solution: Harmonic Mean H = = 1 1 1 1 1 1 1 1 1 1 = 9.06 n ( + + + + + + + + ) 1  1  9 10 7 11 9 9 10 7 9 12   n i=1  xi 

******************************************************************************** Objective: Find the Harmonic Mean of the given class. The table given below represent the frequency-distribution of ages for Standard college students.

Ages (years) 19 20 21 22 23 24 25 26 Number of students 5 8 7 12 28 20 10 10 Solution: Ages (X) 19 20 21 22 23 24 25 26 Number of students(fi) 5 8 7 12 28 20 10 10 1/xi 0.053 0.050 0.048 0.045 0.043 0.042 0.040 0.038 fi *(1/xi) 0.263 0.400 0.333 0.545 1.217 0.833 0.400 0.385

1 H = n 1  fi    N i=1  xi  1 H = 1 (0.263 + 0.400 + 0.333 + 0.545 +1.217 + 0.833 +0.400 + 0.385) 100

1 1 H = = = 22.84 1 (4.377) (0.0437) 100

*************************************************************************************** Objective: Computation of average speed using harmonic mean. Kinds of data: A cyclist pedals from his house to his college at a speed of 10 m.p.h. and back from the college to his house at 15 m.p.h. Find the average speed. Solution. Let the distance from the house to the college be x miles. In going from house to college, the distance (x miles) is covered in x/10 hours, while in coming from college to house, the distance is covered in x/15 hours. Thus a total distance of 2x miles is covered in (x/10 + x/15) hours. Total distance travelled 2x Hence average speed = = Total time taken  x x   +  10 15  2 = =12m.p.h.  1 1   +  10 15  *************************************************************************************** 17

Partition Values - These are the values which divide the series into a number of equal parts. 1. Quartiles: The three points which divide series into four equal parts are called quartiles. The first, second and third points are known as the first, second and third quartiles respectively. The first quartile, Q1, is the value which exceed 25% of the observations and is exceeded by 75% of the observations. The second quartile, Q2, coincides with median. The third quartile, Q3, is the point which has 75% observations before it and 25% observations after it. 2. Deciles: The nine points which divide the series into ten equal parts are called deciles. 3. Percentiles: The ninety-nine points which divide the series into hundred equal parts are called percentiles. For example, D5, the fifth decile, has 50% observations before it and P35, the thirty fifth , is the point which exceed 35% of the observations. The methods of computing the partition values are the same as those of locating the median in the case of both grouped and ungrouped data.

Formula & Examples for ungrouped data set Arrange the data in ascending order, then

th  i.(n +1)  1. Quartiles : Qi =   value of the observatio n, where i =1,2,3  4  th  i.(n +1)  2. Deciles: Di =   value of the observatio n, where i =1,2,3....9  10  th  i.(n +1)  3. Percentiles : Pi =   value of the observatio n, where i =1,2,3....99  100 

*********************************************************************************************************************

Objective: Calculation of first Quartile, 3rd Deciles, 20th Percentile from the given data. Kinds of data: 3,13,11,11,5,4,2 Solution: Arranging Observations in the ascending order, We get : 2,3,4,5,11,11,13

Here, n=7

For first quartile, put i=1 th 1.(7 +1)  Q1 =   value of the observatio n  4  th 1.(8)  Q1 =   value of the observatio n  4  nd Q1 = 2 value of the observation, which is 3.

For 3rd decile, put i=3 th  3.(7 +1)  D3 =   value of the observatio n  10  th  3.(8)  D3 =   value of the observatio n  10  th D3 = (2.4) value of the observatio n 18

nd rd nd D3 = 2 observatio n + 0.4(3 − 2 )

D3 = 3+ 0 . 4 (4 -3)

D3 = 3+ 0 . 4 (1)

D3 = 3 . 4 th  i.(n +1)  Pi =   value of the observatio n, where i =1,2,3....99  100  For 20th percentile, put i=20 th  20.(7 +1)  P20 =   value of the observatio n  100  th  20.(8)  P20 =   value of the observatio n  100  th 160  P20 =   value of the observatio n 100  th P20 = (1.6) value of the observatio n st nd st P20 =1 observatio n + 0.6 [2 −1 ]

P20 = 2 + 0.6 [3 − 2]

P20 = 2 . 6 ********************************************************************************

Objective: Calculation of median, quartiles, 4th decile and 27th percentile. Kinds of data: Eight coins were tossed together and the number of heads resulting was noted. The operation was repeated 256 times and the frequencies (f) that were obtained for different values of x, the number of heads, are shown in the following table. x 0 1 2 3 4 5 6 7 8 f 1 9 26 59 72 52 29 7 1 Solution: x 0 1 2 3 4 5 6 7 8 f 1 9 26 59 72 52 29 7 1 cf 1 10 36 95 167 219 248 255 256

Median: Here N/2 = 256/2 = 128. Cumulative frequency (c.f.) just greater than 128 is 167. Thus, median = 4. N Ql : Here = 64. c.f. just greater than 64 is 95. Hence Ql is 3. 4 3 N Q3 : Here = 192 and c.f. just greater than 1921s 21,9. Thus Q3 = 5. 4 4 N 4 256 D4.: Here = = 102·4 and c.f. just greater than 102·4 is 167. Hence D4=4. 10 10 27  N 27 256 P27: Here = = 69·12 and c.f. just greater than 69·12 is 95. Hence P27=3 100 100 *******************************************************************************

Formula & Examples for grouped data set The partition values may be determined from grouped data in the same way as the median. For calculating partition values from grouped data we will form cumulative frequency column. Quartiles for grouped data will be calculated from the following formulae- 1. Quartile 19

 i N   − C  4 , where i=1,2,3 Qi = l +  *h  f      2. Deciles  i N   − C  D = l +  10 *h , where i=1,2,3….9 i  f      3. Percentiles  i N   − C  P = l +  100 *h , where i=1,2,3….99 i  f      Where l is the lower limit of the class containing quartile, decile and percentile, f is the frequency of the class containing quartile, decile and percentile, N = ∑fi, h is the magnitude of the class containing quartile, decile and percentile, 'C' is cumulative frequency proceeding to the class containing quartile, decile and percentile. ********************************************************************************

Objective: Calculation of 3rd quartiles, 4th decile and 37th percentile from the grouped data.

class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150 frequency 1 4 17 28 25 18 13 6 5 3

Solution: Class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150 frequency 1 4 17 28 25 18 13 6 5 3 Cf 1 5 22 50 75 93 106 112 117 120

3 N 3120 For third quartile = = 90. Cumulative frequency just greater than 90 is 93 and corresponding 4 4 class is 75-90. Thus Q3 class is 75-90. From table we see that l=75, h=15, c=75, f=18

 3 N   3120   − C   − 75  4 , so 4 Q3 = l +  *h Q3 = 75 +  *15 = 87.5  f   18          4 N 4120 For 4th decile = = 48. Cumulative frequency just greater than 48 is 50 and corresponding 10 10 class is 45-60. Thus Q3 class is 45-60. From table we see that l=45, h=15, c=22, f=28  4 N   4120   − C   − 22  D = l +  10 *h , so D = 45 +  10 *15 = 58.85 4  f  4  28          37 N 37 120 For 37th percentile = = 44.4. Cumulative frequency just greater than 44.4 is 50 and 100 100 corresponding class is 45-60. Thus P37 class is 45-60. From table we see that l=45, h=15, c=22, f=28

20

 37 N   37120   − C   − 22  P = l +  100 *h , so D = 45 +  100 *15 = 57 37  f  37  28          *************************************************************************************** Exercise:

Q1. Find the Arithmetic Mean, Median and Mode from the following distribution. classes 10-14 15-19 20-24 25-29 30-34 35-39 frequency 22 35 52 40 32 19 (Ans: A.M.=24.05, Median=23.63, Mode=22.43)

Q2. Find the Arithmetic, Geometric and Harmonic mean of the following frequency distribution. Marks 0-10 10-20 20-30 30-40 No. of students 5 8 3 4 (Ans: A.M.=18.00, GM=14.58, HM=11.31)

Q3. The average salary of male employees in a firm was Rs.5200 and that of females was Rs.4200. The mean salary of all the employees was Rs.5000. Find the percentage of male and female employees. (Ans: Male 80%, Female 20%)

Q4. The Median and Mode of the following wage distribution are known to be Rs. 33.50 and Rs. 34.00 respectively. Find the value of f3, f4 and f5. Wages 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 4 16 f3 f4 f5 6 4 (Ans: f3= 60, f4=100, f5=40)

Q5. Find the arithmetic mean of the following frequency distribution: (Ans:21.66)

Xi 1 4 7 13 19 25 28 22 81 16 fi 7 46 19 51 89 89 28 19 33 93

Q6. The strength of 7 colleges in a city are 385; 1748; 1343; 1935; 786; 2874 and 2108. Find its median. (Ans:1748)

Q7. The mean mark of 100 students was given to be 40. It was found later that a mark 53 was read as 83. What is the corrected mean mark? (Ans: 39.70)

Q8. Calculate 3rd Quartile, 6th Deciles and 45th Percentiles from the following data:-

81,96,76,108,85,80,100,83,70,95,32,33 (Ans: Q3= 102.5, D6= 84.6, P45=83.4)

Q9. Calculate D7 and P85 for the following data: 79, 82, 36, 38, 51, 72, 68, 70, 64, 63 (Ans: D = 71.4, P =81.45) 7 85 Q10.The following is frequency distribution of over time (per week) performed by various officers from a

certain software company. Determine the value of D5, Q1 and P45.

Overtime (in hours) 4-8 8-12 12-16 16-20 20-24 24-28 No. of officers 4 8 16 18 20 18

(Ans- D5= 19.11, Q1=15.3, P45=18.17)

21

3. Measures of Dispersion Surabhi Jain Assistant professor (Statistics), College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India Email id : [email protected]

Dispersion: The measures of central tendency give us a single value that represents the central part of the whole distribution whereas dispersion gives us an idea about the Scatteredness of the data. In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. Dispersion helps us to study the variability of the items. It indicates the extent to which all other values are dispersed about the central value in a particular distribution. Measures of Dispersion: In dispersion, there are two types of measure. The first one is the absolute measure, which measures the dispersion in the same . The second type is the relative measure of dispersion, which measures the ratio or percentage. Dispersion also helps a researcher in comparing two or more series.

Characteristic of an ideal measure of Dispersion: To be an ideal measure, the measure of dispersion should satisfy the following characteristics. (1) It should be easy to calculate and easy to understand. (2) It should be rigidly defined. (3) It should be based upon all the observations. (4) It should be suitable for further mathematical treatment. (5) It should be affected as little as possible by fluctuations of sampling. In statistics, there are many techniques that are applied to measure dispersion.

The absolute measures of dispersion are (1) Range (2) Quartile Deviation (3) Mean Deviation (4)Standard Deviation

(1) Range: It is defined as the difference between the maximum and minimum value of any dataset. For ungrouped data Range = Maximum value – Minimum value For grouped data Range = upper value of last class interval – lowest value of first class interval Characteristics: (1) It is the simplest but the crude measure of dispersion. (2) It takes lesser time. (3) It is based only on two extreme observations so subject to chance fluctuations and cannot tell us anything about the character of the distribution. (4) Range cannot be computed in the case of “open ends’ distribution i.e., a distribution where the lower limit of the first group and upper limit of the higher group is not given.(5) It is not suitable for further mathematical treatment.

(2) Quartile Deviation or Semi- : It is the difference between first and third quartile divided by 2. It is a better method when we are interested in knowing the range within which certain proportion of the items fall. 푄 − 푄 Formula Quartile Deviation= 3 1 2 Characteristics: (1)It is easy to calculate. (2) Since the Quartile deviation only makes the use of 50 % of data so it is also not a reliable measure of dispersion but it is better than range. (3) The quartile deviation is not

22 affected by the extreme items. It is completely dependent on the central items. If these values are irregular and abnormal the result is bound to be affected. (4)This method of calculating dispersion can be applied generally in case of open end series where the importance of extreme values is not considered.

(3) Mean Deviation: It is defined as the average of the sum of absolute deviation of all the observation from their Average A (A=Mean, Median or Mode). ∑|푋 −퐴| For ungrouped data MD = 푖 , where A= mean, median or mode 푛 ∑ 푓 |푋 −퐴| For grouped data MD = 푖 푖 , where A= mean, median or mode ∑ 푓푖 Characteristics: (1) It is based on all the observations but the step of ignoring the signs of deviations creates artificiality and makes it useless for further mathematical treatment. (2) Average Deviation may be calculated either by taking deviations from Mean or Median or Mode. (3) Average Deviation is not affected by extreme items. (4) It is easy to calculate and understand. (5) It is illogical and mathematically unsound to assume all negative signs as positive signs. Because the method is not mathematically sound, the results obtained by this method are not reliable. (6) This method is unsuitable for making comparisons either of the series or structure of the series.

(4) Standard Deviation (Best Measure): It is defined as the square root of the average of the sum of squares of deviation of all the observation from their mean. The concept of standard deviation, which was introduced by Karl Pearson has a practical significance because it is free from all defects, which exists in a range, quartile deviation or average deviation.

∑(푋 −̅푋̅̅̅)2 ∑ 푥 2 ∑ 푥 For ungrouped data SD = √ 푖 =√ 푖 − ( 푖)2 푛 푛 푛

∑ 푓 (푋 −̅푋̅̅̅)2 ∑ 푓 푥 2 ∑ 푓 푥 For grouped data SD =√ 푖 푖 = √ 푖 푖 − ( 푖 푖)2 ∑ 푓푖 ∑ 푓푖 ∑ 푓푖

Characteristics: (1) It is the best measure of dispersion among all. (2) It is difficult to compute. (3) The step of squaring the deviations overcomes the drawback of Mean Deviation. (4) Standard deviation is the best measure of dispersion because it takes into account all the items and is capable of future algebraic treatment and statistical analysis. It is possible to calculate standard deviation for two or more series.(5) This measure is most suitable for making comparisons among two or more series about variability.(6) It assigns more weights to extreme items and less weight to items that are nearer to mean. It is because of this fact that the squares of the deviations which are large in size would be proportionately greater than the squares of those deviations which are comparatively small.

Mathematical properties of standard deviation (σ) (i) If different values are increased or decreased by a constant, the standard deviation will remain the same. If different values are multiplied or divided by a constant than the standard deviation will be multiplied or divided by that constant. (ii) Combined standard deviation can be obtained for two or more series with below given formula:

If n1 and n2 are the sizes, 푥̅̅1̅ and 푥̅̅2̅ are the means and 휎̅̅1̅ and 휎̅̅2̅, the standard deviations of the two series, then the standard deviation 휎 of the combined series of size n1 + n2 is given by

23

2 1 2 2 2 2 푛1푥̅̅̅1̅+푛2̅푥̅̅2̅ 휎 = [ 푛1(휎1 + 푑1 ) + 푛2(휎2 + 푑2 )], where d1 = 푥̅̅1̅ - 푥̅ and d2 = 푥̅̅2̅ - 푥̅ and 푥̅ = , 푛1+푛2 풏ퟏ+풏ퟐ is the mean of combined series. 2 2 (iii) Variance is independent of change of origin means if we use di = xi – A then 휎 = 휎푑 but not 푥푖−퐴 2 2 2 of scale means if we use di = , then 휎 = ℎ 휎 ℎ 푑 ******************************************************************************** Relative Measures for comparison of two series: (1) Coefficient of Dispersion (CD): To compare the variability of two series Coefficient of dispersion is used. They are pure numbers independent of the unit of the measurement. The coefficients of dispersion based upon different measures of dispersion are as follows : based upon Maximum value –minimum value (1) Range, Coefficient of Dispersion = Maximum value+minimum value 푄 − 푄 (2) Quartile Deviation, Coefficient of Dispersion = 3 1 푄3+ 푄1 푀푒푎푛 퐷푒푣𝑖푎푡𝑖표푛 (3) Mean Deviation, Coefficient of Dispersion = 퐴푣푒푟푎푔푒 푓푟표푚 푤ℎ𝑖푐ℎ 𝑖푡 𝑖푠 푐푎푙푐푢푙푎푡푒푑 푆푡푎푛푑푎푟푑 퐷푒푣𝑖푎푡𝑖표푛 (4) Standard Deviation, CD = 푀푒푎푛 Characteristics: Used to compare the dispersion of two or more distributions. Selection of appropriate measure depends upon the measures of central tendency and dispersion.

(2) (CV): 100 times the coefficient of dispersion based upon standard deviation is called coefficient of variation. (Unit less Measure).

푆푡푎푛푑푎푟푑 퐷푒푣𝑖푎푡𝑖표푛 CV = *100 푀푒푎푛

Characteristics: It is expressed in percentage. Lesser value of coefficient of variation indicates more consistency. ******************************************************************************** Objective: Computation of Measures of Dispersion by all methods for Ungrouped data. Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12 Solution: (1) Range=max. value - min. value = 12 – 3 = 9 푄 − 푄 (2) Quartile Deviation: the formula for Quartile Deviation QD= 3 1 2 First arrange the observation in ascending order 3, 5, 7, 7, 9, 9, 10, 10, 12 𝑖∗(푛+1) Now the formula for quartile Qi = , where i= the number of quartile i.e. i=1,2,3,or 4 and 4 n= the number of observation 1∗(9+1) 10 th Q1= 푡ℎ = 푡ℎ 표푏푠푒푟푣푎푡𝑖표푛 =2.5 observation 4 4 nd rd nd So Q1 = 2 term + 0.5*(3 term – 2 term) So Q1= 5+0.5*(7-5) =6 3∗(9+1) 30 th Similarly Q3 = 푡ℎ = 푡ℎ 표푏푠푒푟푣푎푡𝑖표푛 =7.5 observation 4 4 So Q3 = 10+ 0.5*(10-10)=10 (10−6) Now QD = =2 2 24

∑|푋 −퐴| (3) Mean Deviation: The formula for Mean deviation is MD = 푖 , where A= mean, median or 푛 mode Here we calculate first the mean deviation about mean. (10+7+5+9+9+10+7+3+12) Now Mean= =8 9 hence 1 MD= (|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| + 9 |12 − 8|) 1 20 = (2+1+3+1+1+2+1+5+4) = = 2.22 9 9 (4) Standard Deviation: (10+7+5+9+9+10+7+3+12) Mean= =8 9 (10−8)2+(7−8)2+(5−8)2+(9−8)2+(9−8)2+(10−8)2+(7−8)2+(3−8)2+(12−8)2 SD=√ 9

(4+1+9+1+1+4+1+25+16) 62 =√ =√ = 2.62 9 9 ******************************************************************************** Objective: Computation of Measures of Dispersion by all methods for Grouped data. Kinds of data: The age distribution of 542 members are given below

Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total

No. of members 3 61 132 153 140 51 2 542

Solution: (1)Range = 90-20=70 (2) Quartile Deviation : first we will find the first and third quartile

Age(in No. of Cumulative Fi|퐗퐢 − 2 2 Xi FiXi (Xi -푿̅) (Xi -푿̅) Fi(Xi -푿̅) years) members Frequency 푿̅)| 20-30 3 3 25 75 -29.7 89.2 883.3 2649.8 30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6 40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1 50-60 153 349 55 8415 0.3 42.8 0.1 12.0 60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0 70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2 80-90 2 542 85 170 30.3 60.6 916.9 1833.8 Total 542 2183 385 29660 1.96 5152 2800.54 76458.49

1∗542 First we will determine the first Quartile class = = 135.5 4 135.5 come in 40-50 cumulative frequency class. So the first Quartile 1∗542 ( −64) 4 715 Q1 = 40 + * 10 = 40+ = 40+5.42=45.42 years 132 132 3∗542 Similarly for Q3 = = 406.5, 4 406.5 come in 60-70 cumulative frequency class. So the third Quartile is 25

3∗542 ( −349) 4 575 Q3 = 60 + * 10 = 60+ = 60+4.11=64.11 years 140 140 (64.11−45.42) 18.69 So the quartile deviation is = = =9.345 years 2 2 (3) Mean Deviation : first calculate the mean 29660 Mean= = 54.72 years 542 5152 From the above table Mean Deviation = =9.51 years 542 76548.9 (4) Standard Deviation=√ =√141.07 = 11.88 years 542

********************************************************************************

Objective: Computation of variability of two series by coefficient of variation. Kinds of data : Goals scored by two teams A and B in a football season were as follows No. of goals scored in a 0 1 2 3 4 match No. of A 27 9 8 5 4 matches B 17 9 6 5 3

Solution: Here we have to calculate the CV of both the team separately No. of A B goals 풇 풙 풇 (풙 − 풙̅̅̅)ퟐ 풇 풚 풇 (풚 − 풚̅̅̅)ퟐ 푨 풊 (풙 − 풙̅ ̅̅̅2 풊 풊 푩 풊 (풚 − 풚̅) ̅̅̅2 풊 풊 (풇푨) 풊 (풙풊 − 풙) (풇푩) 풊 (풚풊 − 풚) (풙풊) 0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48 1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36 2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84 3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2 4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52 Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4 First we will calculate the mean and standard deviation of first (A) series

56 90.83 휎퐴 1.31 푋̅̅퐴̅ = = 1.05, 휎퐴 = √ = √1.714 = 1.31 then CV= *100 = *100 = 124.76 53 53 푋̅̅̅퐴̅ 1.05 Now we calculate the mean and standard deviation of Second (B) series

48 68.4 휎퐵 1.30 푋̅̅̅퐵̅ = = 1.2, 휎퐵 = √ = √1.71 = 1.30 then CV= *100 = *100 = 108.33 40 40 푋̅̅̅퐵̅ 1.2 After comparing the coefficient of variation of series A and B it was found that the series B because of lower CV value is more consistent. ******************************************************************************** Objective : Comparison of wage earners of two firms Kinds of data : An analysis of monthly wages paid to workers in two firms A and B, belonging to the same industry, gives the following results: Firm A Firm B

Number of wage earners 586 (푛퐴) 648 (푛퐵) Average monthly wage Rs. 52.50 (푋̅̅퐴̅) Rs. 47.50 (푋̅̅̅퐵̅) 2 2 Variance of the distribution of 100 (휎퐴 ) 121 (휎퐵 ) wages

26

(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B) (b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B) (c) What are the measures of (i) average monthly wage and (ii) the variance of the distribution of wages of all the workers in the firms A and B taken together? (49.9, 10.8)

Solution: (a) Here we have to find the total amount of monthly wages paid by firm A and Firm B. Since the number of workers (nA) and average monthly wage (푋̅̅퐴̅) is given. With the help of this ∑ 푋퐴 we calculate ∑ 푋퐴. By using the formula 푋̅̅퐴̅ = , we get ∑ 푋퐴=푛퐴 * 푋̅̅퐴̅ = 586* 52.50=30765 푛퐴 Similarly for Firm B we get ∑ 푋퐵=푛퐵 * 푋̅̅̅퐵̅ = 648* 47.50=30780 Hence we find that the firm B pays out the larger amount as monthly wages. (b) We know that the variability is determined by coefficient of variation. Here we calculate the CV 휎퐴 휎퐵 for both the firm. The formula for 퐶푉퐴 = *100 and 퐶푉퐵 = *100 푋̅̅̅퐴̅ 푋̅̅̅퐵̅ 10 By putting the values we get 퐶푉 = *100=19.04 퐴 52.50

11 And 퐶푉 = *100 =23.15 퐵 47.50 Since 퐶푉퐵 > 퐶푉퐴, hence in the firm B there is greater variability in individual wages.

푛 푥̅̅̅̅+푛 푥̅̅̅̅ 586∗52.50+648∗47.50 30765+30780 (c) (i) 푥̅ = 퐴 퐴 퐵 퐵 = = = 49.87 풏푨+풏푩 586+648 1234

(ii) We know that the formula of combined variance is 2 1 2 2 2 2 휎 = [ 푛퐴 (휎퐴 + 푑퐴 ) + 푛퐴 (휎퐴 + 푑퐴 )], where dA = 푥̅̅퐴̅ - 푥̅ and dB = 푥̅̅퐵̅ - 푥̅ and 푥̅ = 푛퐴+푛퐵 푛 푥̅̅̅̅+푛 푥̅̅̅̅ 퐴 퐴 퐵 퐵 , is the mean of combined series. 풏푨+풏푩 Here dA = 52.50 – 49.87 = 2.63 and dB = 47.50 – 49.87= -2.37 By putting the values we get 1 휎2 = [586(100 + (2.63)2) + 648(121 + (−2.37)2)], 586+648 62653.30+82047.75 By solving we get, 휎2 = =117.26 1234 The variance of the distribution of wages of all the workers in the firms A and B taken together is 117.26 ********************************************************************************

Objective : Standard deviation of combined sample Kinds of data : The first of two samples has 100 items with mean 15 and S.D. 3. If the whole group has 250 items with mean 15.6 and SD √13.44. Find the SD of the second sample. Solution: Here it is given that n1 = 100, 푥̅̅1̅ = 15, and 휎1 = 3 and n=250, 푥̅ = 15.6, σ=√13.44 We know that the formula of combined standard deviation 2 1 2 2 2 2 푛1푥̅̅̅1̅+푛2̅푥̅̅2̅ 휎 = [ 푛1(휎1 + 푑1 ) + 푛2(휎2 + 푑2 )], where d1 = 푥̅̅1̅ - 푥̅ and d2 = 푥̅̅2̅ - 푥̅ and 푥̅ = , 푛1+푛2 풏ퟏ+풏ퟐ is the mean of combined series. So, first we will find the size of second sample n2 = n-n1, so n2=250-100=150. Here since the mean of first sample and combined mean is given. With the help of these we find the 푛 ̅푥̅̅̅+푛 푥̅̅̅̅ mean of second sample. By putting the values in 푥̅ = 1 1 2 2 풏ퟏ+풏ퟐ 100∗15+150∗̅푥̅̅̅ We get 15.6 = 2, by solving we get 푥̅̅̅ = 16 100+150 2

27

Now d1 = 15-15.6 = -0.6 and d2 = 16-15.6=0.4 By putting all these values in the formula of combined variance, we get 1 13.44 = [100(32 + (−0.6)2) + 150(휎 2 + (0.4)2)], 100+150 2 By solving the value of 휎2 = 4 ********************************************************************************

Objective: Corrected mean and corrected standard deviation corresponding to the corrected figures: Kinds of data: for a group of 200 candidates, the mean and standard deviation of scores were found to be 40 and 15 respectively. Later on it was discovered that the scores 43 and 35 were misread as 34 and 53 respectively. Find the corrected mean and corrected standard deviation corresponding to the corrected figures.

Solution: Here it is given that n=200, mean=40 and SD= 15. Wrong scores are 34, 53 and corrected scores are 43 and 35. (i) Corrected mean: to calculate the corrected mean first we find the total score by using the ∑ 푋 formula 푋̅ = , 푛 By putting the values we get ∑ 푋 = 200* 40 = 8000 Next we find the corrected total score= total score- wrong scores + correct scores= 8000- (34+53)+(43+35)=7991 푐표푟푟푒푐푡푒푑 푡표푡푎푙 푠푐표푟푒 7991 Hence the corrected mean= = = 39.95 푛표.표푓 푐푎푛푑𝑖푑푎푡푒 200 ∑ 푥 2 ∑ 푥 (ii) Corrected SD: we know that the formula of SD=√ 푖 − ( 푖)2 , 푛 푛 2 Here the SD is 15 and mean is 40 then first we calculate the sum of square ∑ 푥𝑖 by using the formula of SD. 2 2 2 ∑ 푥𝑖 = n*( 휎 + ̅푥 ) =200*(225+1600) = 365000 2 Now we calculate corrected ∑ 푥𝑖 = 365000-(sum of square of wrong figure)+(sum of square of corrected figure) 2 2 2 2 2 corrected ∑ 푥𝑖 = 365000-(34 + 53 )+(43 + 35 )= 365000- 3965+3074=364109

푐표푟푟푒푡푒푑 푆푢푚 표푓 푠푞푢푎푟푒 364109 Now corrected SD= √ − 푐표푟푟푒푐푡푒푑 푀푒푎푛2 = √ − 39.952 = 푛표.표푓 푐푎푛푑𝑖푑푎푡푒 200 √224.54 = 14.98

Hence the corrected mean = 39.95 and corrected standard deviation=14.98 ********************************************************************************

Important Points on Dispersion: 1. Range, QD, MD and SD are the absolute measures of dispersion. 2. CD and CV are the relative measures of dispersion. 3. Range is the crude measure of dispersion. 4. Standard deviation is the best measure of dispersion. 5. Coefficient of variation is unitless measure of dispersion and suggested by karl pearson. 6. A low standard deviation indicates that the data points tend to be close to the mean.

28

Exercise: Q1. Calculate the variance of the following series. (i) 5,5,5,5,5 (ii) 4,5,6. (Ans. (i) 0, (ii)0.67) Q2. Mean and Standard deviation of 10 figures are 50 and 10 respectively. What will be the mean and SD if (i) every figure is increased by 4 (ii) every figure is multiplied by 2 (iii) if the figures are multiplied by 2 and then diminished by 4? (Ans. (i)54,10 (ii) 100,20 (iii) 96,20).

Q3. Calculate mean deviation and standard deviation from following table: Classes 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 Frequency 2 5 7 13 21 16 8 3 (Ans: Mean Deviation = 6.23, Standard Deviation=8.05)

Q4. If the mean of 100 observations is 50 and CV is 40 %. Calculate the Standard Deviation. (Ans. SD=20) Q5. The arithmetic mean and variance of a set of 10 figures are known to be 17 and 33 respectively. Out of 10 figures one figure (i.e. 26) was found inaccurate and was weeded out. What is the resulting? (a) Arithmetic Mean (b) variance of the 9 figures. (Ans: AM=16, variance = 26.67)

Q6. The means of two samples of size 50 and 100 respectively are 54.1 and 50.3 and the standard deviations are 8 and 7. Obtain the mean and standard deviation of the sample of size 150 obtained by combining the two samples. (Ans: Combined mean=51.57, Combined S.D.= 7.5)

Q7. An analysis of monthly wages paid to workers in two firms A and B, belonging to the same industry, gives the following results: Firm A Firm B Number of wage earners 500 600 Average monthly wage Rs. 186.00 Rs. 175.00 Variance of the distribution of wages 81 100

(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B) (b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B) (c) What are the measures of (i) average monthly wage and (ii) the variance of the distribution of wages of all the workers in the firms A and B taken together? (Ans: Combined monthly wage: Rs. 180, Combined variance = 121.36)

29

4. Moments, Skewness and Kurtosis R. S. Solanki

Assistant professor (Maths & Stat.) , College of Agriculture , Waraseoni, Balaghat (M.P.),India Email id : [email protected]

1. Moments: word is very popular in mechanical sciences. In science moment is a measure of energy which generates the frequency. In Statistics, moments are the arithmetic means of first, second, third and so on, i.e. rth power of the deviation taken from either mean or an arbitrary point of a distribution. In other words, moments are statistical measures that give certain characteristics of the distribution. In statistics, some moments are very important. Generally, in any frequency distribution, four moments are obtained which are known as first, second, third and fourth moments. These four moments describe the information about mean, variance, skewness and kurtosis of a frequency distribution. Calculation of moments gives some features of a distribution which are of statistical importance.

Moments can be classified in raw and central moment. Raw moments are measured about any arbitrary point A (say). If A is taken to be zero then raw moments are called moments about origin. When A is taken to be Arithmetic mean we get central moments. The first raw moment about origin is mean whereas the first central moment is zero. The second raw and central moments are mean square deviation and variance, respectively. The third and fourth moments are useful in measuring skewness and kurtosis.

Methods of Calculation 1. Moments about Arbitrary Point i.e. raw moments For Ungrouped Data

If x1, x2 ,..., xN are N observations of a variable x, then their moments about an arbitrary point A are

' 1 0 Zero order moment 0 = (xi − A) =1 N i

' 1 First order moment 1 = (xi − A) N i

' 1 2 Second order moment 2 = (xi − A) N i

' 1 3 Third order moment 3 = (xi − A) N i

' 1 4 Fourth order moment 4 = (xi − A) N i In general the r th order moment about arbitrary point A is given by

' 1 r r = (xi − A) ; r = 0, 1, 2,... N i For Grouped Data

If x1, x2 ,..., xk are k values (or mid values in case of class intervals) of a variable x with their

corresponding frequencies f1, f2 ,..., fk then moments about an arbitrary point A are

30

' 1 0 Zero order moment 0 =  fi (xi − A) =1 ; N =  fi N i i 1 ' 1 First order moment 1 =  fi (xi − A) N i 2 ' 1 Second order moment 2 =  fi (xi − A) N i 3 ' 1 Third order moment 3 =  fi (xi − A) N i 4 ' 1 Fourth order moment 4 =  fi (xi − A) N i In general the r th order moment about arbitrary point A is given by

' 1 r r =  fi (xi − A) ; N =  fi , r = 0, 1, 2,... N i i

2. Moments about origin: In raw moments if A is taken to be zero then raw moments are called moments about origin and denoted by mr. ∑(푋 )푟 In general, For Ungrouped Data 푚 푖 , here N is number of observation and r=0,1,2,…. 푟= 푁 푟 ∑ 푓푖(푋푖) for Grouped data 푚푟= ∑ 푓푖

3. Moments about arithmetic mean i.e. central moments When we take the deviation from the arithmetic mean and calculate the moments, these are known as moments about arithmetic mean or central moments. For Ungrouped Data

If x1, x2 ,..., xN are N observations of variable x, then their moments about arithmetic mean 1 x =  xi are N i

1 0 Zero order moment 0 = (xi − x) =1 N i

1 1 First order moment 1 = (xi − x) = 0 N i 2 1 2 Second order moment 2 = (xi − x) =  (Variance) N i 1 3 Third order moment 3 =  (xi − x) N i

1 4 Fourth order moment 4 = (xi − x) N i In general the order moment about arithmetic mean x is given by

1 r r = (xi − x) ; r = 0, 1, 2,... N i

31

For Grouped Data

If x1, x2 ,..., xk are k values (or mid values in case of class intervals) of a variable x with their corresponding frequencies f1, f2 ,..., fk then moments about arithmetic mean 1 x =  fi xi ; N =  f are N i i

1 0 Zero order moment 0 =  fi (xi − x) =1 N i 1 1 First order moment 1 =  fi (xi − x) = 0 N i 2 1 2 Second order moment 2 =  fi (xi − x) =  (Variance) N i 1 3 Third order moment 3 =  fi (xi − x) N i 1 4 Fourth order moment 4 =  fi (xi − x) N i In general the r th order moment about arithmetic mean x is given by

1 r r =  fi (xi − x) ; N =  fi , r = 0, 1, 2,... N i i

Relationship between central moments and raw moments: is given by 2 푟 푟 휇푟= 휇푟′ - 푟푐1 (휇1′)( 휇푟−1′)+ 푟푐2 (휇1′) ( 휇푟−2′)+……+(−1) (휇1′)

′ 2 In particular, 휇2= 휇2 - (휇1′) 3 휇3= 휇3′- 3 휇1′휇2′ + 2(휇1′) 2 ′ 4 휇4= 휇4′ - 4 휇1′휇3′+6 (휇1′) ( 휇2)- 3(휇1′)

2 Important: (i) 휇0 = 휇0′=1 (ii) First central moment is always zero. (iii) 휇2 = 푆퐷 =variance ******************************************************************************** 2. Skewness:

The skewness of a distribution is defined as the lack of symmetry. In a symmetrical distribution, the Mean, Median and Mode are equal to each other and the ordinate at mean divides the distribution into two equal parts such that one part is mirror image of the other. If some observations, of very high (low) magnitude, are added to such a distribution, its right (left) tail gets elongated. These observations are also known as extreme observations. The presence of extreme observations on the right hand side of a distribution makes it positively skewed and the three averages, viz., mean, median and mode, will no longer be equal. We shall in fact have Mean > Median > Mode when a distribution is positively skewed. On the other hand, the presence of extreme observations to the left hand side of a distribution make it negatively skewed and the relationship between mean, median and mode is: Mean < Median < Mode (see following figure).

32

Measures of Skewness

1. The Karl Pearson’s coefficient of skewness S k , based on mode is given by

Mean − Mode S = k S.D.

The sign of gives the direction and its magnitude give the extent of skewness. If > 0, the distribution is positively skewed, and if < 0 it is negatively skewed.

Karl Pearson's coefficient of skewness , is defined in terms of median as

3(Mean − Median) Sk = S.D.

The range of Karl Pearson’s coefficient of skewness is −3  S  +3 . k 2. The Bowley’s coefficient of skewness (quartile coefficient of skewness)

(Q3 − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2 Sb = = , where Q1,Q2 and Q3 are first, second and third (Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1 quartiles respectively. The range of Bowley’s coefficient of skewness is −1 S  +1. b 3. Coefficient of skewness based on moments: The Coefficient of skewness based on moments is 2 √휷ퟏ(휷ퟐ+ퟑ) 휇3 휇4 given by Sk = where 훽1 = 3 , 훽2 = 2 . ퟐ(ퟓ휷ퟐ− ퟔ휷ퟏ−ퟗ) 휇2 휇2

********************************************************************************

3. Kurtosis:

Kurtosis is another measure of the shape of a distribution. Whereas skewness measures the lack of symmetry of the frequency curve of a distribution, kurtosis is a measure of the relative peakedness of its frequency curve. Various frequency curves can be divided into three categories depending upon the shape of their peak. The three shapes are termed as Leptokurtic, Mesokurtic and Platykurtic as shown in following figure.

33

Measures of Kurtosis

Karl Pearson’s has developed Beta and Gama coefficients (or Beta and Gama measures) of kurtosis based on the central moments, which are given below respectively

  = 4 and  2 = (2 −3) 2 2 2

The value of 2 = 3 ( 2 = 0) for a mesokurtic (normal) curve. When 2  3 ( 2  0) , the curve is more peaked than the mesokurtic curve and is termed as leptokurtic. Similarly, when 2  3 , the curve is less peaked than the mesokurtic curve and is called as platykurtic curve. ( 2  0)

Objective: Moments, Measures of Skewness and Kurtosis (Ungrouped data). Kinds of data: The daily earnings (in rupees) of sample of 7 agriculture workers are : 126, 121, 124, 122, 125, 124, 123. Compute first four raw (at point 123) and central moments, coefficients of skewness and coefficients of kurtosis. Solution: Moments about any arbitrary value (A=123) i.e. raw moments

Table: Calculation for raw moments.

Sr. No. x (x −123) (x −123)2 (x −123)3 (x −123)4 1 126 3 9 27 81 2 121 -2 4 -8 16 3 124 1 1 1 1 4 122 -1 1 -1 1 5 125 2 4 8 16 6 124 1 1 1 1 7 123 0 0 0 0 Total 865 4 20 28 116

The first raw moment

' 1 1 1 = (xi − A) = 4 = 0.57 N i 7 The second raw moment

2 ' 1 1 2 = (xi − A) = 20 = 2.86 N i 7

34

The third raw moment

3 ' 1 1 3 = (xi − A) = 28 = 4 N i 7 The fourth raw moment

4 ' 1 1 4 = (xi − A) = 116 =16.57 . N i 7 Moments about the Arithmetic Mean i.e. central moments The arithmetic mean of daily earnings of agriculture workers is 1 1 x =  xi = 865 =123.57 N i 7 Table: Calculation for central moments. Sr. x (x −123.57) (x −123.57)2 (x −123.57)3 (x −123.57)4 1 126 2.43 5.90 14.35 34.87 2 121 -2.57 6.60 -16.97 43.62 3 124 0.43 0.18 0.08 0.03 4 122 -1.57 2.46 -3.87 6.08 5 125 1.43 2.04 2.92 4.18 6 124 0.43 0.18 0.08 0.03 7 123 -0.57 0.32 -0.19 0.11 Total 865  0.00 17.71 -3.60 88.92 The first central moment 1 1 1 = (xi − x)= 0.00 = 0.00 N i 7 The second central moment 1 2 1 2 = (xi − x) = 17.71 = 2.53 N i 7 The third central moment 1 3 1 3 = (xi − x) = −3.60 = −0.51 N i 7 The fourth central moment 1 4 1 4 = (xi − x) = 88.92 =12.70 . N i 7 Karl Pearson’s coefficient of skewness

The median ( M d ) of daily earnings of agriculture workers:

Arrange the data in ascending order

121,122,123,124,124,125,126

Total number of observations N = 7 (odd)

Hence the median

35

th th  N +1  7 +1 th M d =   term =   = 4 term= 124.  2   2 

The mode ( M o ) of daily earnings of agriculture workers: Since the frequency of 124 is maximum (i.e. 2), hence =124. Standard deviation (σ): (x − x)2  i 17.71  = i = =1.59 . N 7 Karl Pearson’s coefficient of skewness based on median 3(x − M ) 3(123.57 −124) S = d = = −0.81 k  1.59 Karl Pearson’s coefficient of skewness based on mode x − M 123.57 −124 S = o = = −0.27 . k  1.59 Bowley’s coefficient of skewness Arrange the data in ascending order 121,122,123,124,124,125,126 Total number of observations N = 7

Hence the first quartile Q1 th th  N +1  7 +1 nd Q1 =   term =   term = 2 term= 122  4   4 

Second quartile Q2 = M d = 124

Third quartile Q3 th th  N +1  7 + 1 th Q3 = 3  term = 3  term = 6 term= 125  4   4  Hence Bowley’s coefficient of skewness

(Q3 − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2 125 +122 − 2124 Sb = = = = −0.33 . (Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1 125 −122 Coefficients of kurtosis:  12.70  = 4 = =1.98 2 2 2 2 (2.53) and

 2 = (2 − 3) = (1.98 − 3) = −1.02 .

Hence the curve is negatively skewed and platykurtic.

36

Objective: Moments, Measures of Skewness and Kurtosis (Grouped data). Kinds of data: Compute first four raw (at A=11) and central moments and coefficients of skewness and kurtosis for the following data on milk yield:

Milk yield (kg) 4-6 6-8 8-10 10-12 12-14 14-16 16-18 No. of Cows 8 10 27 38 25 20 7

Solution: Moments about any arbitrary value (A=11) i.e. raw moments

Table: Calculation for raw moments.

Sr. Milk No. of Mid f (x − A) f (x − A)2 f (x − A)3 f (x − A)4 yield (Kg) Cows ( f ) Value (x) 1 4-6 8 5 -48 288 -1728 10368 2 6-8 10 7 -40 160 -640 2560 3 8-10 27 9 -54 108 -216 432 4 10-12 38 11 0 0 0 0 5 12-14 25 13 50 100 200 400 6 14-16 20 15 80 320 1280 5120 7 16-18 7 17 42 252 1512 9072 Total N=135 30 1228 408 27952

The first raw moment

' 1 1 1 =  fi (xi − A) = 30 = 0.22 N i 135 The second raw moment 2 ' 1 1 2 =  fi (xi − A) = 1228 = 9.10 N i 135 The third raw moment 3 ' 1 1 3 =  fi (xi − A) = 408 = 3.02 N i 135 The fourth raw moment 4 ' 1 1 4 =  fi (xi − A) =  27952 = 207.05. N i 135

Moments about the Arithmetic Mean i.e. central moments: The arithmetic mean of milk yield 1 1 x =  fi xi = 1515 =11.22 N i 135

Table: Calculation for central moments. 2 3 4 Sr. No. Milk No. of Mid fx f (x − x) f (x − x) f (x − x) f (x − x) yield Cows Value (kg) (x) 1 4-6 8 5 40 -49.78 309.73 -1927.20 11991.46 2 6-8 10 7 70 -42.22 178.27 -752.70 3178.08

37

3 8-10 27 9 243 -60.00 133.33 -296.30 658.44 4 10-12 38 11 418 -8.44 1.88 -0.42 0.09 5 12-14 25 13 325 44.44 79.01 140.47 249.72 6 14-16 20 15 300 75.56 285.43 1078.30 4073.57 7 16-18 7 17 119 40.44 233.68 1350.15 7800.84 Total N=135 1515 0.00 1221.33 -407.70 27952.20

The first central moment 1 1 1 =  fi (xi − x)= 0.00 = 0.00 N i 135 The second central moment

1 2 1 2 =  fi (xi − x) = 1221.33 = 9.05 N i 135 The third central moment

1 3 1 3 =  fi (xi − x) =  −407.70 = −3.02 N i 135 The fourth central moment

1 4 1 4 =  fi (xi − x) = 27952.20 = 207.05 . N i 135 Karl Pearson’s coefficient of skewness

Table: Calculation for median and mode. Sr. Milk No. of Mid cf yield Cows Value  N +1 135 +1 Median number=   = = 68 (kg) ( f ) (x)  2  2 1 4-6 8 5 8 ⸫ Median class = (10-12). 2 6-8 10 7 18 3 8-10 27 9 45 4 10-12 38 11 83 Maximum frequency = 38 5 12-14 25 13 108 ⸫ Model class = (10-12). 6 14-16 20 15 128 7 16-18 7 17 135 Total N=135

The median ( M d ) of milk yield:

L1 =10,i = 2, f =38,N =135,C = 45 i  N  2 135  Median = M d = L1 +  − C  =10 +  − 45 =11.18. f  2  38  2 

The mode ( M o ) of milk yield:

L1 =10, f1 =38, f0 = 27, f2 = 25,i = 2

f1 − f0 38 − 27 Mode = M o = L1 + i =10 +  2 =10.92 . 2 f1 − f0 − f 2 238 − 27 − 25 Standard deviation (σ): 38

f (x − x)2  i i 121.33  = i = = 3.01. N 135 Karl Pearson’s coefficient of skewness based on median

3(x − M d ) 3(11.22 −11.18) Sk = = = 0.04 .  3.01 Karl Pearson’s coefficient of skewness based on mode x − M 11.22 −10.92 S = o = = 0.10 . k  3.01 Bowley’s coefficient of skewness:

The first quartile Q1 th th  N  135  th Q1 =   term =   term = 33.75  34 term  4   4  34th term is in the class interval “8-10” . Hence

L1 = 8,i = 2, f =27,N =135,C =18 i  N  2 135  Q1 = L1 +  − C  = 8 +  −18 = 9.17 f  4  27  4 

Second quartile Q2 = M d = 11.18

Third quartile Q3 th th  3N   405  th Q3 =   term =   term = 101.25 101 term  4   4  101th term is in the class interval “12-14” . Hence

L1 =12,i = 2, f =25,N =135,C = 83 i  3N  2  3135  Q3 = L1 +  − C  =12 +  −83 =13.46 Hence Bowley’s coefficient of skewness f  4  25  4 

(Q3 − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2 13.46 + 9.17 − 2 11.18 Sb = = = = 0.06 (Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1 13.46 − 9.17

Coefficients of kurtosis:  207.05  = 4 = = 2.53 . 2 2 2 2 (9.05) and

 2 = (2 −3)= (2.53−3)= −0.47 . Hence the curve is positively skewed and platykurtic. ******************************************************************************** Objective: Computation of Mean and variance when moments about arbitrary value is given . Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16 and -40.

Solution: Here arbitrary value A=2 and the moments are 휇1′ =1, 휇2′=16 and 휇3′= -40 1 ∑ 푓푖(푋푖−2) ∑ 풇풊풙풊 ∑ 풇풊 ∑ 풇풊풙풊 We know that 휇1′= =1 , hence - 2 =1 which gives 푥̅ = = 1+2 = 3 ∑ 푓푖 ∑ 풇풊 ∑ 풇풊 ∑ 풇풊 Hence the mean is 3.

39

2 We know that 휇2= 휇2′ - (휇1′) , by putting the values we get 2 휇2= 휇2′ - (휇1′) = 16- 1*1 = 15 Hence the variance is 15. ******************************************************************************** Exercise:

Q1. The marks obtained by 46 students in an examination are as follows:

Marks 0-5 5-10 10-15 15-20 20-25 25-30 Students 5 7 10 16 4 4 Calculate Karl Pearson’s and Bowley’s coefficients of skewness. (Ans.: Karl Pearson’s coefficient of skewness = -0.31 and Bowley’s coefficient of skewness = -0.22)

Q2. Calculate Karl Pearson’s and Bowley’s coefficients of skewness for the following distribution:

Measurement 3.5 4.5 5.5 6.5 7.5 8.5 9.5

Frequency 3 7 22 60 85 32 8

(Ans.: Karl Pearson’s coefficient of skewness = -0.36 and Bowley’s coefficient of skewness = -1)

Q3. Compute the first four raw and central moments with coefficient of kurtosis for the following data:

Plant Height (cm) 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 No. of plants 5 14 16 25 14 12 8 6

' ' ' ' (Ans. 1 = −3.65,2 = −98.25,3 = −766.25,4 = 20756.25  = 0, = 84.93, = 212.34, =16890.14; = −0.66. 1 2 3 2 )

********************************************************************************

40

5. Correlation and Regression Surabhi Jain Assistant Professor (Statistics) , College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India Email id : [email protected]

Correlation: Correlation is a measure of linear relationship between two variables. It is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. Correlation works for quantifiable data. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color. It can be defined as a bi-variate analysis that measures the strength of association between two variables and the direction of the relationship. Karl Pearson correlation coefficient (or Product moment correlation coefficient): Pearson r correlation is the most widely used correlation to measure the degree of the relationship between linearly related variables. The correlation coefficient between X on Y and Y on X is same and calculated by Karl Pearson correlation formula 푐표푣(푥,푦) ∑(푥 −̅푥̅̅)(푦 −푦̅) 푛 ∑ 푥 푦 −∑ 푥 푦 푟 푖 푖 푖 푖 푖 푖 , 푥푦 = = ̅̅̅2 ̅̅̅2= 2 2 2 2 휎푥휎푦 √∑(푥푖−푥) ∑(푦푖−푦) √푛 ∑ 푥푖 −(∑ 푥푖) √푛 ∑ 푦푖 −(∑ 푦푖) th th here n=number of observation, xi = value of i observation of x variable, yi = value of i observation of y variable.

Assumptions: Normality: Both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Linearity: It assumes a straight line relationship between each of the two variables. : It assumes that data is equally distributed about the regression line. It basically means that the along the line of best fit remain similar as you move along the line. Type of correlation:

Positive Correlation: If two variables deviate in the same direction then the correlation is said to be positive correlation. The line corresponding to the is an increasing line sloping up from left to right. Example height and weight of a group of persons, the income and expenditure etc.

Negative Correlation: If two variables deviate in the opposite direction, increase in one results in decrease in the other, then the correlation is said to be negative correlation. The line corresponding to the scatter plot is an decreasing line sloping down from left to right. Example price and demand of a commodity.

No correlation: occurs when there is no linear dependency between the variables.

41

Range of Correlation coefficient (r): Correlation coefficient lies between -1 to +1. It is a pure number and independent of unit of measurement. Effect of change of origin and scale: Correlation coefficient is independent of change of origin(Xi=Xi-A) and scale(Xi=Xi/h). Correlation between independent variables: Two independent variables are uncorrelated, but two uncorrelated variables (if r=0 found) need not necessarily be independent. Test of significance of correlation coefficient (Null Hypo. r=0): To test the significance of correlation coefficient the t test statistic is used as follows:

푟푐푎푙∗√푛−2 푡푐푎푙= 2 at (n-2) d.f., here rcal is the calculated value of correlation coefficient. √1−푟푐푎푙

To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2) degree of freedom. If tcal>ttab, the null hypothesis is rejected and we conclude that the correlation is significant. If tcal

**************************************************************************

Objective: Computation of correlation coefficient and test of significance of correlation coefficient of the given data. Kinds of data: The marks obtained by 8 students in Mathematics and Statistics are given below: Student A B C D E F G H Mathematics 25 30 32 35 37 40 42 45 Statistics 8 10 15 17 20 22 24 25

Solution: Let us assume that the marks in mathematics are X and marks in Statistics are Y. We know that the formula for correlation coefficient is ∑(푥 −̅푥̅̅)(푦 −푦̅) r= 푖 푖 , ̅̅̅2 ̅̅̅2 √∑(푥푖−푥) ∑(푦푖−푦) First we calculate the mean of X and Y. ∑ 푋 286 ∑ 푌 141 푋̅ = 푖 = = 35.75 ≈ 36 and 푌̅ = 푖 = = 17.63 ≈ 18 푛 8 푛 8 Other calculations are presented below in the table. ퟐ ퟐ Student Mathematics (X) Statistics(Y) (푿풊 − 푿̅) (풀풊 − 풀̅) (푿풊 − 푿̅)(풀풊 − 풀̅) (푿풊 − 푿̅) (풀풊 − 풀̅) A 25 8 -11 -10 110 121 100 B 30 10 -6 -8 48 36 64 C 32 15 -4 -3 12 16 9 D 35 17 -1 -1 1 1 1 E 37 20 1 2 2 1 4 F 40 22 4 4 16 16 16 G 42 24 6 6 36 36 36 H 45 25 9 7 63 81 49 total 286 141 -2 -3 288 308 279

By putting these values in the formula we get 288 r= = 0.983 √308∗279 42

Test of significance of r=0.983 To test the significance of correlation coefficient the t test statistic is used as follows:

푟푐푎푙∗√푛−2 푡푐푎푙= 2 at (n-2) d.f, √1−푟푐푎푙 By putting the values in the formula we get 0.983∗ 8−2 푡 = √ = 13.11 푐푎푙 √1−0.9832 The table value of t at 6 degree of freedom at 5 % level of significance is 2.447. Conclusion: Here since the calculated value of t (13.11) is greater than tabulated value of t (2.447) at 6 degree of freedom. The null hypothesis is rejected and we found that the correlation coefficient is highly significant. This indicates that marks in mathematics are associated with marks in statistics. ********************************************************************************

Objective : Corrected correlation coefficient corresponding to the corrected figures: Kinds of data: In two set of variables X and Y with 50 observations each, the following data were observed: 푋̅ =10, 휎푥 = 3, 푌̅ =6, 휎푦 = 2 and r(x,y)=0.3 But on subsequent verification it was found that one value of X (=10) and one value of (Y=6) were inaccurate and hence weeded out. With the remaining 49 pair of values, how is original value of r affected? Solution: First we will find the corrected mean ∑ 푋 We know that 푋̅ = 푖 , here 푋̅ =10 and n=50 then by solving we get ∑ 푋 = n*푋̅ = 50*10=500 푛 𝑖 Since it was found that one value X=10 was inaccurate so we remove it from X and get

∑ 푋𝑖 = 500-10=490, now the number of observation is 49. ∑ 푋 490 Hence the corrected mean 푋̅ = 푖 = =10 푛 49 ∑ 푋 2 ∑ 푋 ∑ 푋 2 We know that 휎 2 = 푖 -( 푖)2 or 휎 2 = 푖 - 푋̅2 or ∑ 푋 2 = n*(휎 2 + 푋̅2) 푥 푛 푛 푥 푛 𝑖 푥 2 Here n=50, 휎푥 = 3 and 푋̅ =10 then ∑ 푋𝑖 = 50* (9 + 100)= 5450 2 2 Now we find the corrected ∑ 푋𝑖 by removing 10 . 2 Corrected ∑ 푋𝑖 = 5450-100=5350 2 2 Now we will find the corrected 휎푥 by putting the corrected ∑ 푋𝑖 and corrected mean 푋̅ and n=49 ∑ 푋 2 5350 corrected 휎 2 = 푖 - 푋̅2 = - 102 = 109.18-100=9.18 푥 푛 49 2 Similarly we repeat the same procedure for variable Y and find the corrected mean 푌̅ and ∑ 푌𝑖 ∑ 푌𝑖 = n*푌̅ = 50*6=300, Since it was found that one value Y=6 was inaccurate so we remove it from Y and get

∑ 푌𝑖 =300-6=294, now the number of observation is 49. ∑ 푌 294 Hence the corrected mean 푌̅ = 푖 = =6 푛 49 ∑ 푌 2 휎 2 = 푖 - 푌̅2 or ∑ 푌 2 = n*(휎 2 + 푌̅2) = 50*(22 + 62)=2000 푦 푛 𝑖 푦 2 2 Now we find the corrected ∑ 푌𝑖 by removing 6 . 2 Corrected ∑ 푌𝑖 = 2000-36=1964 2 2 Now we will find the corrected 휎푦 by putting the corrected ∑ 푌𝑖 and corrected mean 푌̅ and n=49 ∑ 푌 2 1964 corrected 휎 2 = 푖 - 푌̅2 = - 62 = 40.08-36=4.08 푦 푛 49

43

∑ 푥푦 −푥̅푦̅ 푐표푣(푥,푦) 푛 ∑ 푥푦 Here since r = = = 0.3, so Cov(x,y)= r*휎푥 ∗ 휎푦= − 푥̅푦̅ 휎푥∗휎푦 휎푥∗휎푦 푛 ∑ 푥푦 By putting the values 0.3*3*2= - 10*6 , we get ∑ 푥푦 = 50*(1.8+60)=3090 50 Next we get the corrected value of ∑ 푥푦 = 3090-wrong values=3090-10*6=3030 ∑ 푥푦 3030 −푥̅푦̅ − 10∗6 1.84 Hence corrected r= 푛 = 49 = = 0.3 휎푥∗휎푦 √9.18∗4.08 6.12 Hence we found that there is no change in correlation coefficient. *******************************************************************************

Exercise: Q1. Define correlation coefficient. Also write the properties of correlation coefficient. Q2. Calculate the correlation coefficient between the variable X and Y from the following bi- variate data. X 71 68 70 67 70 71 70 73 Y 69 67 65 63 65 62 65 64 (Ans: r=0) Q3. Calculate the correlation coefficient between the variable X and Y from the following bi- variate data. X 1 3 4 5 7 8 10 Y 2 6 8 10 14 16 20 (Ans: r=1)

Q4. Calculate the correlation coefficient between the heights of father and son from the following data: Height of father (inches) 65 66 67 68 69 70 71 Height of Son (inches) 67 68 66 69 72 72 69 Apply t test to test the significance and interpret the result. (Ans: r=+0.67, tcal=2.02)

Q5. A computer while calculating correlation coefficient between two variables X and Y from 25 pairs of observation obtained the following results: n=25, ∑ 푋=125, ∑ 푋2 = 650, ∑ 푌=100, ∑ 푌2 = 460, ∑ 푋푌=508. But on subsequent verification it was found that he had copied down two pairs as (6,14) and (8,6) while the correct values were (8,12) and (6,8). Obtain the correct value of correlation coefficient? (Ans: 0.67) *************************************************************************

44

Regression: The term regression was given by a British biometrician Sir Francis Galton. It is a mathematical measure of average relationship between two or more variables. Regression is a technique used to model and analyze the relationships between variables and often times how they contribute and are related to producing a particular outcome together. A refers to a regression model that is completely made up of linear variables. Lines of Regression: The line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. The line of regression is the line of best fit and is obtained by the principle of least squares. Both the lines of regression passes through or intersect at the point (푋̅, 푌̅). In linear regression there are two lines of regression. One is Y on X (Y=a+b*X), where X is independent variable and Y is dependent variable. By applying the principle of least square the regression line for Y on X is given by (y-풚̅) =풃풚풙(x-풙̅) ̅̅̅ 푐표푣(푥,푦) 푟휎푥휎푦 휎푦 ∑(푥푖−푥)(푦푖−푦̅) Where 푏푦푥 = 2 = 2 = r = ̅̅̅2 휎푥 휎푥 휎푥 ∑(푥푖−푥) Other One is X on Y (X=a+b*Y) where Y is independent variable and X is dependent variable. By applying the principle of least square the regression line for X on Y is given by (x-풙̅) =풃풙풚(y-풚̅) for x on y ̅̅̅ 푐표푣(푥,푦) 푟휎푥휎푦 휎푥 ∑(푥푖−푥)(푦푖−푦̅) Where 푏푥푦 = 2 = 2 = r = ̅̅̅2 휎푦 휎푦 휎푦 ∑(푦푖−푦)

Here 푏푦푥 and 푏푥푦are the regression coefficient and shows the change in dependent variable with a unit change in independent variable.

Angle between two lines of Regression: 휎 휎 We know that the slope of two lines of regression are r 푦 and 푥 . If θ is the angle between 휎푥 푟휎푦 2 1−푟 휎푥휎푦 two lines of regression then tan θ = ( 2 2) 푟 휎푥 +휎푦 Properties of Regression coefficient and relationship between correlation and regression coefficient • Regression coefficient lies between -∞ to +∞. • Regression coefficient is independent of change of origin and but not of scale. • Correlation coefficient is the Geometric mean between the regression coefficients.

(r = ±√푏푦푥 ∗ 푏푥푦) • If one of the regression coefficients is greater than unity the other must be less than unity. • Arithmetic mean of the regression coefficient is greater than the correlation coefficient r if r>0. 1 (푏 + 푏 ) ≥ 푟 2 푦푥 푥푦 • If the two variables are uncorrelated the lines of regression become perpendicular to each other. 휋 (If r=0, 휃 = ) 2 • The two lines of regression are coincide with each other if r= ±1, 푡ℎ푒푛 휃 = 0 표푟 휋. • The sign of correlation coefficient and regression coefficients are same because each of them depends on sign of cov(x,y).

45

Test of significance of regression coefficient (Null Hypo. 푏푦푥 =0, 푏푥푦 = 0): To test the significance of regression coefficient the t test statistic is used as follows: 푏푦푥 푏푦푥 푡푐푎푙 = = based on (n-2) d.f.(for y on x ) 푆.퐸.표푓 푏 2 푦푥 (∑(푥−푥̅)(푦−푦̅)) √(∑(푦−̅푦̅̅)2− )/(푛−2) ∑(푥−̅푥̅̅)2 ∑(푥−푥̅̅̅)2

푏푥푦 푏푥푦 푡푐푎푙 = = based on (n-2) d.f.(for x on y ) 푆.퐸.표푓 푏 2 푥푦 (∑(푥−푥̅)(푦−푦̅)) √(∑(푥−̅푥̅̅)2− )/(푛−2) ∑(푦−푦̅̅̅)2 ∑(푦−푦̅̅̅̅)2 To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2) degree of freedom. If tcal>ttab, the null hypothesis is rejected and we conclude that the regression coefficient is significant. If tcal

Y on X is Y= 5+2.8X, so the regression coefficient are 푏푦푥 = 2.8

Similarly, X on Y is X= 3-0.5Y, so the regression coefficient are 푏푥푦 = -1.5 Since we know that the sign of both the of regression coefficients are same. Here the sign of both coefficients are different from each other which is not possible. Hence the equations are not the estimated regression equations of Y on X and X on Y respectively. *******************************************************************************

Objective: Determination of (i) line of regression of Y on X and X on Y (ii) mean of X and mean of Y (iii) variance of Y when the variance of X is given?. Kinds of data: The two lines of regression X+2Y-5=0 and 2X+3Y-8=0 and variance of X is 12 is given. Solution: (i) If we assume the line X+2Y-5=0 as the regression line of Y on X , then the equation can be written as 2Y= -X+5 or Y= -0.5X+2.5 and 푏푦푥 = -0.5 Similarly if we assume the line 2X+3Y-8=0 as the regression line of X on Y , then the equation can be written as 2X= -3Y+8 or X= -1.5X+4 and 푏푥푦 = -1.5 Here since the sign of both the regression coefficient are same and also the one regression coefficient is greater than unity and other one is smaller than unity. We can also verify r = √푏푦푥 ∗ 푏푥푦 =√−0.5 ∗ −1.5= √0.75= -0.87, which lies between -1 to +1. So our estimation of line of regression of Y on X and X on Y is correct. (ii) Since both the line of regressions passes through the point (푋̅, 푌̅), the equations can be written as 푋̅+2푌̅-5=0 ………(1) 2푋̅+3푌̅-8=0 ………….(2) by solving these equations we get 푋̅ and ̅푌. By multiplying 2 in the equation (1) we get 2푋̅+4푌̅-10=0….(3) Subtract equation (2) from equation (3) we get 2푋̅+4푌̅-10- 2푋̅-3푌̅ +8=0 By solving we get ̅푌 = 2, by putting the value of ̅푌 in equation (1) we get ̅푋 = 1. Hence the mean of X and Y are ̅푋 = 1 and ̅푌 = 2. 2 2 (iii) Here 휎푥 =12 is given. We have to find 휎푦 . 46

휎푦 Since we know that 푏푦푥 = r* , the value of r, 푏푦푥 and 휎푥is known. 휎푥 휎 By putting these values −0.5 = -0.87* 푦 , we get 휎 = 1.99≈ 2 3.46 푦 2 Hence 휎푦 = 4. *******************************************************************************

Objective: Construction of line of regression and estimation of dependent variable when mean, standard deviation and correlation coefficient is given. Kinds of data: The following results were obtained in the analysis of data on yield of dry bark in ounces (Y) and age in years (X) of 200 cinchona plants: X (age in Years) Y (Yield) Average 9.2 16.5 Standard Deviation 2.1 4.2 Correlation coefficient +0.84 Estimate the yield of dry bark of a plant of age 8 years.

Solution: Here 푋̅ = 9.2, 휎푋=2.1, 푌̅ = 16.5, 휎푌=4.2 and r=0.84

(i) Construction of line of regression: we know that the line of regression of Y on X is given by 휎푦 4.2 (Y-푌̅) =푏푌푋(X-푋̅), where 푏푦푥 = r = 0.84* = 1.68 휎푥 2.1 By putting the values we get (Y-16.5) =1.68 ∗(X-9.2) Y=1.68X+1.04 Similarly the line of regression of X on Y is given by 휎푋 2.1 (X-푋̅)=푏푌푋(Y − 푌̅), where 푏푋푌 = r = 0.84* = 0.42 휎푌 4.2 By putting the values we get (X-9.2) =0.42 ∗(Y-16.5) X=0.42Y+2.27 (ii) Estimation of the yield (Y) of dry bark of a plant of age 8 years(X): The line of regression of Y on X is Y=1.68X+1.04, Put X=8 we get Y=1.68*8+1.04 = 14.48 Hence the yield (Y) of dry bark of a plant of age 8 years(X) is 14.48 ounce.

Objective: Computation of correlation coefficient and the equations of the line of regression of Y on X and X on Y and the estimation of the value of Y when the value of X is known and the value of X when the value of Y is known.

Kinds of data: The following table relate to the data of stature (inches) of brother and sister from Pearson and Lee’s sample of 1,401 families.

Family 1 2 3 4 5 6 7 8 9 10 11 number Brother,X 71 68 66 67 70 71 70 73 72 65 66 Sister,Y 69 64 65 63 65 62 65 64 66 59 62

47

759 7o4 Solution: First we calculate the mean 푋̅ = = 69 , 푌̅ = = 64 11 11 Family Brother Sister (푿 − 푿̅) (풀 − 풀̅) (푿 − 푿̅)ퟐ (풀 − 풀̅)ퟐ (푿 − 푿̅)(풀 − 풀̅) Number X Y 풊 풊 풊 풊 풊 풊 1 71 69 2 5 4 25 10 2 68 64 -1 0 1 0 0 3 66 65 -3 1 9 1 -3 4 67 63 -2 -1 4 1 2 5 70 65 1 1 1 1 1 6 71 62 2 -2 4 4 -4 7 70 65 1 1 1 1 1 8 73 64 4 0 16 0 0 9 72 66 3 2 9 4 6 10 65 59 -4 -5 16 25 20 11 66 62 -3 -2 9 4 6 Total 759 704 74 66 39 Then by using the formula of correlation coefficient, we have ∑(푥 −̅푥̅̅)(푦 −푦̅) 39 푟 = 푖 푖 = = 0.558 푥푦 ̅̅̅2 ̅̅̅2 √∑(푥푖−푥) ∑(푦푖−푦) √74∗66 Test of significance of correlation coefficient

푟푐푎푙∗√푛−2 0.558∗ √11−2 t = 2 = 2 = 2.018 √1−푟푐푎푙 √1−0.558 The table value of t at 9 df. At 5 % level of significance is 2.26. Since t calculated is less than t tabulated the null hypothesis is accepted. The correlation coefficient is not significant. Calculation of Regression Coefficient Using the formula of regression coefficient of Y on X and X on Y, we have ̅̅̅ ̅̅̅ ∑(푥푖−푥)(푦푖−푦̅) 39 ∑(푥푖−푥)(푦푖−푦̅) 39 푏푦푥 = ̅̅̅2 = = 0.527, 푏푥푦 = ̅̅̅2 = = 0.591 ∑(푥푖−푥) 74 ∑(푦푖−푦) 66

Hence, the equation of regression line of Y on X is Y- 64 = 0.527 (X-69) Hence, the equation of regression line of X on Y is X- 69 = 0.591 (Y-64) Estimation of Y when X is given : If we want to calculate the value of Y for X=70 then by putting X=70 in the line of regression of Y on X we get Y - 64 =0.527*(70 -69) Hence Y= 64 + 0.527 * 1 =64.527 Estimation of X when Y is given : If we want to calculate the value of X for Y=62 then by putting Y=62 in the line of regression of X on Y we get X - 69 =0.591(62 -64) Hence X= 69 + 0.591 * (-2) =67.82 Test of significance of regression coefficient of y on x 푏푦푥 0.527 0.527 푡푦푥= = = = 2.017 2 (39)2 0.261 ̅̅̅2 (∑(푥−푥̅)(푦−푦̅)) ̅̅̅2 66− √(∑(푦−푦) − ̅̅̅2 )/(푛−2) ∑(푥−푥) √ 74 ∑(푥−푥) (11−2)∗74

48

Test of significance of regression coefficient of x on y 푏푥푦 0.591 0.591 푡푥푦= = = = 1.799 2 (39)2 0.292 ̅̅̅2 (∑(푥−푥̅)(푦−푦̅)) ̅̅̅2 74− √(∑(푥−푥) − ̅̅̅̅2 )/(푛−2) ∑(푦−푦) √ 66 ∑(푦−푦) (11−2)∗66 Since the value of t calculated is less than t tabulated. Regression coefficients are not significant. *******************************************************************************

Exercise:

Q1. Define Regression Coefficient. Also write the properties of Regression coefficient.

Q2.The observations on X(Marks in Economics) and Y (Marks in Maths) for 10 students are given below: X 59 65 45 52 60 62 70 55 45 49 Y 75 70 55 65 60 69 80 65 59 61 Compute the least square regression equations of Y on X and X on Y. Also estimate the value of Y for X=61. (Ans: Y-65.9=0.76*(X-56.2), X-56.2=0.92(Y-65.9), Y=69.54for X=61)

Q3. The following data pertain to the marks in subjects A and B in a certain examination Subject A Subject B Mean marks 39.5 47.5 Standard Deviation of marks 10.8 16.8 Correlation coefficient +0.42 Find the two lines of regression and estimate the marks in B for candidates who secured 50 marks in A. (Ans. Y=0.65X+21.82, X=0.27Y+26.67, Y=54.34 for X=50)

Q4. From the observations of the age (X) and the mean blood pressure (Y), following quantities were calculated: - 푋̅ = 60, 푌̅ = 141, ∑ 푥2= 1000, ∑푦2= 1936, ∑ 푥푦=1380, where x=X-푋̅ and y=Y-푌̅. Find the regression equation of Y on X and estimate the mean blood pressure for women of age 35 years. (Ans: Y=1.38X+65.1, Y=113.4 For X=35)

49

6. Test of Significance Mujahida Sayyed Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India Email id : [email protected] Once sample data has been gathered through an , allows analysts to assess some claim about the population from which the sample has been drawn. The methods of inference used to support or reject claims based on sample data are known as tests of significance.

Null Hypothesis: Every test of significance begins with a null hypothesis H0. H0 represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug.

Null Hypothesis H0: there is no difference between the two drugs on average.

Alternative Hypothesis: The alternative hypothesis, Ha, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug.

Alternative Hypothesis Ha: the two drugs have different effects, on average.

The alternative hypothesis might also be that the new drug is better, on average, than the current drug. In this case Ha: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null hypothesis. "reject H0 in favor of Ha" or "do not reject H0"; we never conclude "reject Ha", or even "accept Ha".

If we conclude "do not reject H0", this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 in favor of Ha; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

Hypotheses are always stated in terms of population parameter, such as the mean µ. An alternative hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either larger or smaller than the value given by the null hypothesis. A two-sided hypothesis claims that a parameter is simply not equal to the value given by the null hypothesis the direction does not matter.

Hypotheses for a one-sided test for a population mean take the following form: H0: = k Ha: > k or H0: = k

Ha: < k.

Hypotheses for a two-sided test for a population mean take the following form: H0: = k

Ha: k.

50

1. t TEST FOR SINGLE MEAN: A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution. 푥̅ − 휇 푡 = 푠/√푛 Where μ = Population Mean 푥̅ =Sample Mean ∑(푥 −푥̅)2 s = Sample standard deviation= √ 푖 푛−1 n = No. of sample observation

if tcal > ttab then the difference is significant and null hypothesis is rejected at 5% or 1% level of significance.

if ttab< tcal then the difference is non- significant and null hypothesis is accepted at 5% or 1% level of significance. ******************************************************************************* Objective: Test the significance of difference between sample mean and population mean. Kinds of data: Based on field , a new variety of greengram is expected to give an yield of 12 quintals per hectare. The variety was tested on 10 randomly selected farmer's fields. The yields (quintal/hectare) were recorded 14.3, 12.6, 13.7, 10.9, 13.7, 12.0, 11.4, 12.0, 12.6 and 13.1. Do the result conform the expectation?

Solution: Here the null and alternative hypothesis is H0 =The average yield of the new variety of greengram is 12q/hac. Vs H1= The average yield of the new variety of greengram is not 12q/hac.

푥̅−휇 we know that, t-test for single mean is given by 푡 = 푠/√푛

It is given that population mean μ= 12 and n=10, 126.3 then we calculate the Sample mean x̅ = =12.63. 10 ∑(푥 −푥̅)2 Next we have to calculate S=√ 푖 푛−1 Total Yields(xi) 14.3 12.6 13.7 10.9 13.7 12 11.4 12 12.6 13.1 126.3 (xi-퐱̅) 1.67 -0.03 1.07 -1.73 1.07 -0.63 -1.23 -0.63 -0.03 0.47 (퐱퐢 − 퐱̅)ퟐ 2.79 0.00 1.14 2.99 1.14 0.40 1.51 0.40 0.00 0.22 10.60 10.60 Standard deviation of sample s= √ 9

51

= 1.085 By putting the values in t statistics we get x̅−μ t = s/√n 12.63 − 12 = 1.0853/√10 t = 1.836 and d.f. = 10-1= 9 The table value of t at 9 d.f. and 5% level of significance is ttab = 2.262. Since tcal < ttab. Difference is not significant and we accept the null hypothesis. Result: Here we accept null hypothesis this means that the new variety of greengram will give an average yield of 12 quintals per hectare. *******************************************************************************

2. t TEST FOR TWO SAMPLE MEAN: Comparison of two sample means 푥̅ and 푦̅ assumed to have been obtained on the basis of random samples of sizes n1and n2 from the same population which is assumed to be normal.

The approximate test is given by (under H0: 푥̅ = 푦̅ against H1: 푥̅ ≠ 푦̅)

푥̅− 푦̅ ∑ 푋 ∑ 푌 t = , where 푋̅ = 푖 and 푌̅ = 푖, 1 1 푛 푛 푠√ + 푛1 푛2

푛1 푛2 2 1 2 2 푠 = ⌊∑(푥𝑖 − 푥̅) + ∑(푦𝑖 − 푦̅) ⌋ 푛1 + 푛2 − 2 𝑖=1 𝑖=1

푛 푠2+푛 푠2 = 1 1 2 2 푛1+푛2−2

’ follows Student s t statistics with 푛1 + 푛2 − 2 d. f.

*******************************************************************************

Objective : To test the significance of difference between two treatment means. Kinds of data: Two kinds of manure applied to 15 plots of one acres; other condition remaining the same. The yields (in quintals) are given below Manure I: 14 20 34 48 32 42 30 44 Manure II: 31 18 22 28 40 26 45 Examine the significance of the difference between the mean yields due to the application of different kinds of manure.

Solution: Here the null and alternative hypothesis is H0 : There is no significance difference between two the mean yields due to the application of

different kinds of manure. Vs

H1 : There is significance difference between two the mean yields due to the application of different kinds of manure. 52

we use t test for difference of mean 풙̅− 풚̅ t = ퟏ ퟏ 풔√ + 풏ퟏ 풏ퟐ x̅ and y̅ are the sample mean of I and II sample. ∑ 푋 264 ∑ 푌 210 푋̅ = 푖 = = 33 and 푌̅ = 푖= =30 푛 8 푛 7

2 1 n1 2 n2 2 Next we will calculate s = ⌊∑i=1(xi − x̅) + ∑i=1(yi − y̅) ⌋ n1+n2−2

Manure I (x-풙̅) (x-풙̅)2 Manure II (y-풚̅) (y-풚̅)2 14 -19 361 31 +1 1 20 -13 169 18 -12 144 34 +1 1 22 -8 64 48 +15 225 28 -2 4 32 -1 1 40 +10 100 42 +9 81 26 -4 16

30 -3 9 45 +15 225

44 +11 121 264 968 210 554

1 By putting the values we get 푠2 = ⌊968 + 554⌋ = 117.07 8+7−2 Then s = 10.82 By putting all the values in 푥̅− 푦̅ t = , 1 1 푠√ + 푛1 푛2 we get 33− 30 t = 1 1 10.82√ + 8 7 = 0.54 d.f. = 푛1 + 푛2 − 2 = 13

The tabulated value of t for 13 d. f at 5% level of significance is 2.16. Since tcal < ttab then it is not significant and we accept null hypothesis. Result : Since tcal < ttab then we conclude that there is no significance difference between the two mean yields due to the application of different kinds of manure. *******************************************************************************

3. t TEST FOR PAIRED OBSERVATION: This test is used for testing whether two series of paired observations are generated from the same population on the basis of the difference in their sample means. The approximate test is given by

푑̅ t = , follows student’s t-distribution with n-1 d.f. 푠/√푛

∑푛 푑 ∑푛 (푑 −푑̅)2 Here 푑̅ = 푖=1 푖 and 푠2 = 푖=1 푖 푛 푛−1

53

th di = xi -yi being the difference of the i observation in the two sample. ******************************************************************************* Objective: To test the significance of difference between two treatment means, when observations are paired. Kinds of data: Two treatments A and B are assigned randomly to two animals from each of six litters. The following increase in body weights(oz.) of the animals were observed at the end of the experiment Treatment Litter Number 1 2 3 4 5 6 A 28 32 29 36 29 34 B 25 24 27 30 30 29 Test the significance of the difference between treatments A and B. Solution: Hypothesis H0 : There is no significance difference between treatments A and B. Vs H1: There is significance difference between treatments A and B. 푑̅ Here since the observation are paired Student’s t-distribution with n-1 d.f. is t = 푠/√푛 푛 푛 ̅ 2 ∑푖=1 푑푖 ∑푖=1(푑푖−푑) Where 푑̅ = , and di = xi -yi 푛 푛−1

̅ ̅ ퟐ Litter Treatment di = xi -yi 풅풊 − 풅 (풅풊 − 풅) number A(xi) B(yi) 1 28 25 3 -0.83 0.69 2 32 24 8 4.17 17.36 3 29 27 2 -1.83 3.36 4 36 30 6 2.17 4.70 5 29 30 -1 -4.83 23.36 6 34 29 5 1.17 1.36 Total 푛 ∑𝑖=1 푑𝑖 =23 50.83

∑푛 푑 23 ∑푛 (푑 −푑̅)2 50.83 ̅푑̅̅ = 푖=1 푖 = =3.83 and 푠2 = 푖=1 푖 = = 10.17 푛 6 푛−1 6−1 then s = 3.19 3.833 by putting the values we get t= = 2.94 3.1885/√6 degree of freedom = 6-1 = 5 The table value of t at 5% level of significance and 5 degree of freedom is 2.571. Since tcal > ttab , Difference is significant hence null hypothesis is rejected. Result: Since null hypothesis is rejected, therefore there is significance difference between treatments A and B. ******************************************************************************* 4. F TEST (VARIANCE RATIO TEST): F distribution is applied in several tests of significance relating to the equality of two sampling variances drawn on the basis of independent samples from a normal population. The approximate test is Larger estimate of variance Variance Ratio (F) = Smaller estimate of variance

54

2 푠1 = 2 푆2 푛1 2 푛2 2 2 ∑푖=1(푥푖−푥̅) 2 ∑푖=1(푥2−푥̅) where 푠1 = and 푠2 = 푛1−1 푛2−1 Follows F distribution with n1-1 and n2-1 d.f.. ****************************************************************************** Objective: To test the significance of equality of two sample variances. Kinds of data: Two random samples are chosen from two normal populations Sample I: 20 16 26 27 23 22 18 24 25 19 Sample II: 17 23 32 25 22 24 28 18 31 33 20 27 Obtain estimates of the variance of the population and test whether the two populations have the same variance. Solution: Here the null and alternative hypothesis is H0: The two populations have the same variance. Vs H1: The two populations have not the same variance. Larger estimate of variance We know that Variance Ratio (F) = Smaller estimate of variance 2 푠1 = 2, follows F distribution with n1-1 and n2-1 d.f. 푆2 푛1 2 푛 ∑ (푥 −푥̅) ∑ 2 2 2 푖=1 푖 2 푖=1(푥2−푥̅) Here 푠1 = and 푆2 = 푛1−1 푛2−1 ∑ 푋 220 ∑ 푌 300 푋̅ = 푖 = = 22 푎푛푑 푌̅ = 푖 = =25 푛 10 푛 12 Sample I Sample II 2 2 xi (xi-풙̅) (xi-풙̅) yi (yi-풚̅) (yi-풚̅) 20 -2 4 17 -8 64 16 -6 36 23 -2 4 26 4 16 32 7 49 27 5 25 25 0 0 23 1 1 22 -3 9 22 0 0 24 -1 1 18 -4 16 28 3 9 24 2 4 18 -7 49 25 3 9 31 6 36 19 -3 9 33 8 64 20 -5 25 27 2 4 220 120 300 314

∑푛1 2 2 푖=1(푥푖−푥̅) 120 By putting the values we get, 푠1 = = = 13.33 and 푛1−1 10−1 ∑푛2 2 2 푖=1(푥2−푥̅) 314 푠2 = = = 28.55 푛2−1 12−1

2 푠2 28.55 Hence we get F = 2 = = 2.14 푆1 13.33 The tabulated value of F at 5 % level of significance and 9 and all d.f. is 2.89. Since Fcal < Ftab , it is not significant and null hypothesis is accepted. Result: Since Fcal < Ftab the null hypothesis is accepted and we conclude that the two population have the same variances. 55

****************************************************************************** 5. CHI- SQUARE TEST (χ2 TEST): 흌ퟐ test for : Chi square is a measure to evaluate the difference between observed frequencies and expected frequencies and to examine whether the difference so obtained is due to a chance factor or due to sampling error. To test the goodness of fit the chi-square test statistic is given by

2 2 (푂푖−퐸푖) χ = ∑ , at (n-1) d.f. , Where, 푂𝑖 = Observed Frequency 퐸푖 퐸𝑖 = Expected Frequency

흌ퟐ test for 2X2 : In a contingency table if each attribute is divided into two classes it is known as 2×2 contingency table.

a b (a+b)

c d (c+d)

(a+c) (b+d) N=a+b+c+d

For such data, the statistical hypothesis under test is that the two attribute are independent of one 푁(푎푑−푏푐)2 another. For the 2X2 contingency table, the 휒2 test is given by 휒2 = , where (푎+푐)(푏+푑)(푎+푏)(푐+푑) N=a+b+c+d, for 1 d.f. Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square (푎+푏)(푎+푐) (푎+푏)(푏+푑) test of goodness of fit. eg. E(a) = , E(b) = or accordingly. 푎+푏+푐+푑 푎+푏+푐+푑 To test the goodness of fit the chi-square test statistic is given by

(푂 −퐸 )2 χ2 = ∑ 푖 푖 , at 1 d.f. 퐸푖

Where, 푂𝑖 = Observed Frequency, 퐸𝑖 = Expected Frequency 2 2 If 훘푐푎푙 > 훘푡푎푏, at 1 d.f., we reject the null hypothesis. ******************************************************************************* Yates’ correction for continuity : F. Yates has suggested a correction for continuity in χ2 value calculated in connection with a (2 × 2) table, particularly when cell frequencies are small (since no cell frequency should be less than 5 in any case, through 10 is better as stated earlier) and x2 is just on the significance level. The correction suggested by Yates is popularly known as Yates’ correction. It involves the reduction of the deviation of observed from expected frequencies which of course reduces the value of x2 . The rule for correction is to adjust the observed frequency in each cell of a (2 × 2) table in such a way as to reduce the deviation of the observed from the expected frequency for that cell by 0.5, but this adjustment is made in all the cells without disturbing the marginal totals. The formula for finding the value of c2 after applying Yates’ correction can be stated thus:

Yates correction in chi square test

56

It may again be emphasised that Yates’ correction is made only in case of (2 × 2) table and that too when cell frequencies are small. ******************************************************************************* Objective: Testing whether the frequencies are equally distributed in a given dataset. Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits were as follows. Digits 0 1 2 3 4 5 6 7 8 9 Frequency 22 21 16 20 23 15 18 21 19 25

Solution: We set up the null hypothesis H0: The digits were equally distributed in the given dataset. 푠푢푚 표푓 푓푟푒푞푢푒푛푐푦 200 Under the null hypothesis the expected frequencies of the digits would be = = =20 푛표.표푓 표푏푠푒푟푣푎푡𝑖표푛 10 (22−20)2 (21−20)2 (16−20)2 (20−20)2 (23−20)2 (15−20)2 (18−20)2 Then the value of 휘2 = + + + + + + 20 20 20 20 20 20 20 (21−20)2 (19−20)2 (25−20)2 1 86 + + + = (4+1+16+0+9+25+4+1+1+25) = =4.3 20 20 20 20 20 The tabulated value of 휘2 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value of 휘2 is less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the digits are equally distributed in a given dataset. ***************************************************************************** Objective: Chi-square test for 2X2 contingency table Kinds of data: The table given below show the data obtained during an epidemic of cholera Germinated Not Germinated Inoculated 31 469 Not Inoculated 185 1315 Test the effectiveness of inoculation in preventing the attack of cholera. Solution: Here the null and alternative hypothesis is H0: Inoculation is not effective in preventing the attack of cholera i. e. 푂𝑖 = 퐸𝑖, Vs H1: Inoculation is effective in preventing the attack of cholera i.e. 푂𝑖 ≠ 퐸𝑖. Here we use χ2 test ퟐ 2 (푶풊−푬풊) χ = ∑ , Where, 푂𝑖 = Observed Frequency, 퐸𝑖 = Expected Frequency 푬풊 Observed Frequencies are: Germinated Not Germinated Total Chemically Treated 31 469 500 Untreated 185 1315 1500 57

Total 216 1784 2000 Calculation of Expected Frequencies: 500×216 1500×216 For germinated E(31)= = 54, E(185) = = 162 2000 2000 500×1784 1500×1784 For not germinated E(469)= = 446, E(1315) = = 1338 2000 2000 Expected Frequencies are: Germinated Not Germinated Total Chemically Treated 54 446 500 Untreated 162 1338 1500 Total 216 1784 2000 (푶 −푬 )ퟐ Next we calculate χ2 = ∑ 풊 풊 , 푬풊 Observed Expected Difference Square of 2 2 Frequencies(Oi) frequencies(Ei) (Oi-Ei) differences (Oi-Ei) (Oi-Ei) /Ei 31 54 -23 529 9.796 469 446 23 529 1.186 185 162 23 529 3.265 1315 1338 -23 529 0.395 Total 14.642 2 Here we get χ푐푎푙 = 14.64 Degree of Freedom= (2-1)(2-1) = 1 Table values for 1 degree of freedom at 5% level of significance = 3.841 2 2 Since 훘푐푎푙 = 14.642 and 훘푡푎푏 = 3.841 2 2 훘푐푎푙 > 훘푡푎푏, we reject the null hypothesis. 2 2 Result: 훘푐푎푙 > 훘푡푎푏, we reject the null hypothesis that is Inoculation is effective in preventing the attack of cholera. *******************************************************************************

Objective: Chi-square test for 2X2 contingency table when cell frequency is less than 5 Kinds of data: The following information was obtained in a sample of 50 small general shops. Can it be said that there are relatively more women owners in villages than in town?

Shops In Towns In Villages Total Run by men 17 18 35 Run by women 3 12 15 Total 20 30 50 Test your result at 5% level of significance .( 휒2 for 1 d. f. is 3.841) Solution: The null and alternative hypothesis are H0: there are not relatively more women owners in villages than in town. Vs. H1: there are relatively more women owners in villages than in town. Here since the one cell frequency is less than 5 we apply the chi-square formula along with Yate’s correction as given below 푁 2 [|푎푑 − 푏푐| − ] 푁 휒2 = 2 퐶1 퐶2 푅1푅2

푎 푏 Where 2*2 contingency table is | | 푐 푑 퐶1 = sum of first Column 푅1 = sum of first Row 58

퐶2 = sum of second Column R2 = sum of second Row and N = Grand total By putting the values in the formula we gwt 50 2 [|17×12−18×3|− ] 50 휒2 = 2 = 2.48 20×30×35×15 The critical value of 휒2 for 1 d.f. and α = 0.05 is 3.841 i.e. 2 2 훘푐푎푙 = 2.48 and 훘푡푎푏 = 3.841 2 2 훘푐푎푙 < 훘푡푎푏, we accept the null hypothesis. 2 2 Result: 훘푐푎푙 < 훘푡푎푏, we accept the null hypothesis. It may be conclude that there are not relatively more women owners in villages than in town. ******************************************************************************* Exercise: Q1. Six boys are selected at random from a school and their marks in Mathematics are found to be 63, 63, 64, 66, 60, 68 out of 100. In the light of these marks, discuss the general observation that the mean marks in Mathematics in the school were 66. (Ans. tcal =-1.78) Q2. The summary of the result of an yield trial on onion with two methods of propagation is given below. Determine whether the methods differ with regard to onion yield. The onion yield is given in kg/plot

Method I n1= 12 푥1̅ = 25.25 Sum of square = 186.25 Method II n2= 12 푥̅2 = 28.83 Sum of square = 737.67 Ans. tcal =-1.35) Q3. A certain stimulus administrated to each 12 patients resulted in the following change in blood pressure di 5 2 8 -1 3 0 -2 1 5 0 4 Can it be concluded that the stimulus will in general be accompanied by an increase in blood pressure? (Ans. Paired t test, tcal = 2.89) Q4. The following table gives the number of units produced per day by two workers A and B for a number of days: A: 40 30 38 41 38 35 B: 39 38 41 23 32 39 40 34 should these results be accepted as evidence that B is the more stable worker? 2 2 (Ans. 푆1 =16, 푆2 =31.44) Q5. A certain type of surgical operation can be performed either with a local anesthetic or with a general anesthetic. Results are given below Alive Dead Local 511 24 General 147 18 Use 휒2 test for testing the difference in the mortality rates associated with the different types 2 of anesthetic. (Ans. 훘푐푎푙 = 9.22) Q6. Twenty two animals suffered from the same disease with the same severity. A serum was administered to 10 of the animals and the remaining were uninoculated to serve as control. The results were as follows: Recovered Died Total Inoculated 7 3 10 Uninoculated 3 9 12 Total 10 12 22 Apply the휒2 test to test the association between inoculation and control of the disease. 2 Interpret the result. (Ans. 훘푐푎푙 = 2.82)

59

7. (One way and Two way classification) P.Mishra Assistant professor (Statistics) , college of agriculture , JNKVV, Powarkheda, (M.P.) 461110,India Email id : [email protected]

Analysis of Variance (ANOVA) :The ANOVA is a powerful statistical tool for tests of significance. The test of significance based on t-distribution is an adequate procedure only for testing the significance of the difference between two sample means. In a situation when we have two or more samples to consider at a time, an alternative procedure is needed for testing the hypothesis that all the samples have been drawn from the same population. For example, if three fertilizers are to be compared to find their efficacy, this could be done by a field experiment, in which each fertilizer is applied to 10 plots and then the 30 plots are later harvested with the crop yield being calculated for each plot. Now we have 3 groups of ten figures and we wish to know if there are any differences between these groups. The answer to this problem is provided by the technique of ANOVA. Assumptions of ANOVA The ANOVA test is carried out based on these below assumptions, • The observations are normally distributed • The observations are independent from each other • The variance of populations are equal

Treatments: The objects of comparison in an experiment are defined as treatments i) Suppose an Agronomist wishes to know the effect of different spacing on the yield of a crop, different spacing will be treatments. Each spacing will be called a treatment. (2) A teacher practices different teaching methods on different groups in his class to see which yields the best results. (3) A doctor treats a patient with a skin condition with different creams to see which is most effective. Experimental unit: Experimental unit is the object to which treatment is applied to record the observations. ) If treatments are different varieties, then the objects to which treatments are applied to make observations will be different plot of land. The plots will be called experimental units. Blocks : In agricultural experiments, most of the times we divide the whole experimental unit (field) into relatively homogeneous sub-groups or strata. These strata, which are more uniform amongst themselves than the field as a whole are known as blocks. Degrees of freedom: It is defined as the difference between the total number of items and the total number of constraints. If “n” is the total number of items and “k” the total number of constraints then the degrees of freedom (d.f.) is given by d.f. = n-k. In other words the number of degrees of freedom generally refers to the number of independent observations in a sample minus the number of population parameters that must be estimated from sample data. Level of significance(LOS): The maximum probability at which we would be willing to risk a type-I error is known as level of significance or the size of Type-I error is level of significance. The level of significance usually employed in testing of hypothesis are 5% and 1%. The Level of significance is always fixed in advance before collecting the sample information. LOS 5% means the results obtained will be true is 95% out of 100 cases and the results may be wrong is 5 out of 100 cases. Experimental error: 60

The variations in response among the different experimental units may be partitioned in to two components:

i) the systematic part / the assignable part and ii) the non-systematic / non assignable part.

Variations in experimental units due to different treatments, etc. which are known to the experimenter, constitute the assignable part. On the other hand, the part of the variation which can not be assigned to specific reasons or causes are termed as the experimental error. Thus it is often found that the experimental units receiving the same treatments and experimental conditions but providing differential responses. This type of variations in response may be due to inherent differences among the experimental units, error associated during measurement etc. these factor are known as extraneous factor. So the variation in responses due to these extraneous factors is turned as experimental error. The purpose of designing an experiment is to increase the precision of the experiment. For reducing the experimental error, we adopt some techniques. These techniques form the 3 basic Principles of experimental designs.

1. : The repetition of treatments under investigation is known as replication. A replication is used (i) to secure more accurate estimate of the experimental error, a term which represents the differences that would be observed if the same treatments were applied several times to the same experimental units; (ii) To reduce the experimental error and thereby to increase precision, which is a measure of the variability of the experimental error. 2. : Random allocation of treatments to different experimental units known as randomization. 3. Local control: It has been observed that all extraneous sources of variation are not removed by randomization and replication. This necessitates a refinement in the experimental technique. For this purpose, we make use of local control, a term referring to the grouping of homogeneous experimental units. The main purpose of the principle of local control is to increase the of an experimental design by decreasing the experimental error. One-Way ANOVA One-way ANOVA is an inferential to analyze three or more than three variances at a time to test the equality between them. It's a test of hypothesis for several sample means investigating only one factor at k levels corresponding to k populations is called One Way ANOVA. Users may use this 1-way ANOVA test calculator to generate the ANOVA classification table for the test of hypothesis by comparing estimated F-statistic (F0) from the samples of populations & critical value of F (Fe) at a stated level of significance (such as 1%, 2%, 3%, 4%, 5% etc) from the F-distribution table. Only one factor can be analyzed at multiple levels by using this method. This technique allows each group of samples to have different number of observations. It should satisfy replication & randomization to design the statistical experiments.

ANOVA Table for One-Way Classification ANOVA table for one-way classification shows what are all the formulas & input parameters used in the analysis of variance for one factor which involves two or more than two treatment means together to check if the null hypothesis is accepted or rejected at a stated level of significance in statistical experiments. 61

Sources of Variation df SS MSS F-ratio

Between Treatment k - 1 SST SST/k-1= MST MST/ MSE= F T

Error N - k SSE SSE/N-k= MSE

Total N - 1

Notable Points for One-Way ANOVA Test

The below are the important notes of one-way ANOVA for test of hypothesis for a single factor involves three or more treatment means together.

• The null hypothesis H0 : μ1 = μ2 = . . . = μk Alternative hypothesis H1 : μ1 ≠ μ2 ≠ . . . ≠ μk • State the level of significance α (1%, 2%, 5%, 10%, 50% etc) • The sum of all N elements in all the sample data set is known as the Grand Total and is represented by an English alphabet "G". • The correction factor CF = G2/N • The Total Sum of Squares all individual elements often abbreviated as TSS is obtained by 2 TSS = ∑∑xij - CF • The Sum of Squares of all the class Totals often abbreviated as SST is obtained by 2 SST = ∑Ti /ni - CF • The Sum of Squares due to Error often abbreviated as SSE is obtained by SSE = TSS - SST • The degrees of freedom for Total Sum of Squares TSS = N - 1 • The degrees of freedom for Sum of Squares of all the class Totals SST = k - 1 • The degrees of freedom for Sum of Squares due to Error SSE = N - k • The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by MST = SST/(k - 1) • The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by MSE = SSE/(N - k) • The variance ratio of F between the treatment is the higher variance to lower variance F = MST/MSE or MSE/MST (The numerator should be always high) • The Critical value of F can be obtained by referring the F distribution table for (k-1, N-k) at stated level of significance such as 1%, 5%, 9%, 10% or 50% etc. • The difference between the treatments is not significant, if the calculated F value is lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted. • The difference between the treatments is significant, if the calculated F value is greater than the value from the F table. Therefore, the null hypothesis H0 is rejected. *******************************************************************************

62

COMPLETELY RANDOMIZED DESIGN (CRD) Completely randomized design (CRD) is the simplest of all designs where only two principles of i.e. replication and randomization have been used. The principle of local control is not used in this design. The basic characteristic of this design is that the whole experimental area (i) should be of homogeneous in nature and (ii) should be divided into as many number of experimental unit as the sum of the number of replications of all the treatments. Let us, suppose there are five treatments A, B, C, D, E replicated 5, 4, 3, 3, and 5 times respectively then according to this design we require the whole experimental area to be divided in to 20 experimental units of equal size. Thus, completely randomized design is applicable only when the experimental area is homogeneous in nature. Under laboratory condition, where other conditions including the environmental condition are controlled, completely randomized design is the most accepted and widely used design. Let there be t treatments replicated r1, r2, ……rt times respectively. So in total t we require an experimental area of  ri number of homogeneous experimental units of equal size. i=1

Randomization and Layout :

To facilitate easy understanding we shall demonstrate the layout and randomization procedure in a field experiment conducted in CRD with 5 treatments A, B, C, D, E being replicated 5, 4, 3, 2, 6 times respectively. The steps are given as follows :

(i) Total number of experimental unit required is 5+4+3+2+6 = 20 . Divide the whole experimental area into 20 experimental units of equal size. For laboratory experiments the experimental units may be test tubes, petri dishes, beakers, pots etc. depending upon the nature of the experiment. (ii) Number the experimental units 1 to 20.

Experimental area

Figure – 1

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

Figure – 2: Experimental area divided and numbered in to 20 experimental units

(iii) Assign the five treatments in to 20 experimental units randomly in such a way that the treatments A, B, C, D, E are allotted 5, 4, 3, 2, 6 times respectively. For this we require a random number table and follow the steps given below: A) Method 1: Start at any page, any point of row-column intersection of random number table. Let the starting point be the intersection of 5th row – 4th column and read vertically downward to get 20 distinct random number of two digits. Since 80 is the highest two digit number which is multiple of 20, we reject the number 81 to 99 and 00.If the random number be more than 20 63 then it should be divided by 20 and the remainder will be taken. The process will continue till we have 20 distinct random numbers; if remainder is zero then we shall take it as the last number i.e. 20. a) In the process the random numbers selected are 08, 12, 01, 18, 14, 18, 02, 12, 12, 20, 12, 10, 14, 00, 15, 07, 05, 16, 7, 18, 19, 03, 10, 08, 16, 09, 13, 14, 17, 18, 06, 17, 19, 08, 15 and 11. b) Repeated random numbers appeared in the above list, so we shall discard the random numbers which have appeared previously. Thus the selected random numbers will be 08, 12, 01, 18, 14, 02, 20, 10, 15, 07, 05, 16, 19, 03, 09, 13, 04, 17, 06, 11. These random numbers correspond to the 20 experimental units. c) To first 5 experimental units corresponding to first 5 random numbers allotted with the first treatment A, next 4 experimental units corresponding to next four random numbers are allotted with treatment B and so on. d) We demonstrate the whole process (a) to (d) in the following table : Random numbers Treatment Remainder Selected random numbers taken from the table allotted 08 08 08 A 32 12 12 A 01 01 01 A 58 18 18 A 14 14 14 A 18 18 Not selected - 02 02 02 B 12 12 Not selected - 52 12 Not selected ― 20 20 20 B 12 12 Not selected ― 10 10 10 B 14 14 Not selected ― 00 00 Not selected - 55 15 15 B 07 07 07 C 05 05 05 C 16 16 16 C 27 7 Not selected - 18 18 Not selected ― 79 19 19 D 03 03 03 D 10 10 Not selected 08 08 Not selected 56 16 Not selected 29 9 9 E 13 13 13 E 14 14 14 E 17 17 17 E 18 18 Not selected 46 6 6 E 37 17 Not selected 59 19 Not selected 08 08 Not selected 15 15 Not selected 11 11 11 E

64

1 A 2 B 3 D 4 E 5 C 6 E 7 C 8 A 9 E 10 B 11 E 12 A 13 E 14 A 15 B 16 C 17 E 18 A 19 D 20 B Figure – 3: Layout along with allocation of treatments

B) Method 2:

Step 1: In the first method we take 2 digit random numbers and in the process we are to reject a lot of random numbers because of repetition. To avoid, instead of taking 2 digit random numbers one may take 3 digit random numbers starting from any page any point intersection of row-column of random number table. Let us use the same random number table and start at the intersection of 5th row 2nd column i.e. 208. We take 20 distinct random numbers of 3 digits and the numbers are 208, 412, 480, 318, 094, 158, 082, 232, 252, 020, 392, 950, 394, 800, 435, 187, 851, 164, 273, 384. Interestingly, we do not discard any number because of repetition in the process i.e. chances of ties is less here.

Step 2: Rank the random numbers with smallest number getting the lowest rank 1. Thus the random number along with their respective ranks are :

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384 Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12

These ranks correspond to the 20 numbered experimental units

Step 3: Allot first treatment A to first five plots appearing in order i.e. allot treatment A to 7,15,17,11 and 3rd experimental units. Allot treatment B to next four experimental units i.e.4,2,8,and 9th experimental units and so on.

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384 Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12 Treat. A A A A A B B B B C C C D D E E E E E E

Layout :

1 C 2 B 3 A 4 B 5 E 6 E 7 A 8 B 9 B 10 E 11 A 12 E 13 C 14 D 15 A 16 E 17 A 18 D 19 E 20 C Figure – 4: Layout along with allocation of treatments.

C) Method 3 : The above two methods are applicable only when random number table is available. But while conducting experiments at farmers field random number table may not be available. To overcome this difficulty, we may opt for ‘drawing lots’ technique for randomization. The procedure is as follows :

65 a) According to this problem we are to allocate five treatments in to twenty experimental units. Initially we take a piece of paper and make 20 small pieces of equal size and shape. b) Twenty pieces paper, thus made, are then labeled and numbered according to treatments and corresponding number of replications such that five papers are marked with ‘A’, four with ‘B’, three with ‘C’, two with ‘D’ and six with ‘E’. c) Fold the papers uniformly and place them in bucket/busket/jar etc. d) Draw one piece of paper at a time and repeat the drawing without replacing it and with continuous stirring of the container after every draw. e) Note the sequence of the appearance of the treatments. f) Allot the treatments to the experimental units based on the treatment letter label and the sequence. Thus here the sequence correspond to the experimental units from one to twenty. Let the appearance of the treatment for this case be as follows : Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Treatment D B A A C D B E B C A E E C B E A E E A

Thus the treatment A is allotted to the experimental units 3, 4, 11, 17 and 20, treatment B to 2, 7, 9, 15 and so on. Ultimately the final layout will be as follows : 1 D 2 B 3 A 4 A 5 C 6 D 7 B 8 E 9 B 10 C 11 A 12 E 13 E 14 C 15 B 16 E 17 A 18 E 19 E 20 A Figure – 5: Layout along with allocation of treatments.

Analysis :

Statistical Model : Let there be t number of treatments with r1, r2, r3….rt number of replications respectively in a completely randomized design. So the model for the experiment will be : yeijiij=++ ; i =1,2,3….t; j = 1,2,….ri

th th where yij = response corresponding to j observation of the i treatment  = general effect

i = additional effect due to i-th treatment and rii = 0 2 eij = error associated with j-th observation of i-th treatment and are i.i.d N(0,  ). Assumption of the model: The above model is based on the assumptions that the affects are additive in nature and the error components are identically, independently distributed as normal variate with mean zero and constant variance.

Hypothesis to be tested : H :...... =  =  == = 0 against = = the altern ative hypothesis 0 1 2 3 it H1 : All 's are not equal

t

Let the level of significance be  . Let the observations of the total n =  ri experimental units be i=1 as follows:

66

Replication Treatment 1 2 ………. i ………. t 1 y11 y21 ……….. yi1 ………. yt1 2 y12 y22 ………. yi2 ………. yt2 : : : ……… : ………. : : : : : : : : y2r2 : : ………. : ………. : ri ………. yiri ………. : : y1r1 : : y trt Total y1o y2o ………. yio ……….. yto Mean y1o y 2o yio yto

The analysis for this type of data is the same as that of one way classified data discussed in chapter 1 section(1.2). From the above table we calculate the following quantities :

t ri Grand total = (observation) ij=1

t rt

= yyyyyG112131++++== ...... trtij  ij==11 G 2 Correction factor = = CF n

ttrrit 2 2 Total Sum of Squares (TSS) =  (observation) −=−CFyCF ij ijij===111 2 2 2 2 = y11+ y 21 + y 31 +...... + y − CF trt Treatment Sum of Squares (TrSS) 2 t ri yio th =−==CFyy,where sumiij0 of the observations for the i treatment ij==11ri

yyyyy22222 =++++−123oooioto ...... CF rrrrr123 it

Error Sum of Squares ( By Subtraction ) = T SS – Tr SS = Er SS. ANOVA table for Completely Randomized Design: SOV d.f. SS MS F- Tabulated Tabulated ratio F (0.05) F (0.01) Treatment t-1 TrSS TrMS = TrSS TrMS t −1 ErMS Error n-t ErSS ErMS = ErSS nt− Total n-1 TSS

67

The null hypothesis is rejected at  level of significance if the calculated value of F ratio corresponding to treatment be greater than the table value at the same level of significance with (t- 1,n-t) degrees of freedom that means we reject Ho if Fcal > Ftab otherwise one can not ;( 1t n−− ) ,(t ) reject the null hypothesis. When the test is non-significant we conclude that there exists no significant differences among the treatments with respect to the particular characters under consideration; all treatments are statistically at par. When the test is significant i.e. when the null hypothesis is rejected then one should find out which pair of treatments are significantly different and which treatment is either the best or the worst with respect to the particular characters under consideration.

One of the ways to answer these queries is to use t – test to compare all possible pairs of treatment means. This procedure is simplified with the help of least significant difference (critical difference) value as per the given formula below :

11 LSDErMSt=+ () where, ' refer to the treatments involved in ,()nt− i a n d i 2 rrii' comparison and t is the table value of t distribution at  level of significance with (n-t) d.f. and

11 ErMS()+ is the of difference (SEd) between the means for treatments i and rrii' i. Thus if the absolute value of the difference between the treatment means exceeds the corresponding CD value then the two treatments are significantly different and the better treatment is adjudged based on the mean values commensurating with the nature of the character under study.

Advantages and disadvantages of CRD:

A) Advantages: i) Simplest of all experimental design. ii) Flexibility in adopting different number of replications for different treatments. This is the only design in which different number of replications can be used for different treatments. In practical situation it is very useful because sometimes the experimenter come across with the problem of varied availability of experimental materials. Sometimes response from particular experimental unit(s) may not be available, even then the data can be analyzed if CRD design was adopted.

B) Disadvantage: i) The basic assumption of homogeneity of experimental units, particularly under field condition is rare. That is why this design is suitable mostly in laboratory condition or green house condition. ii) The principle of “local control” is not used in this design which is very efficient in reducing the experimental error. With the increase in number of treatment especially under field condition it becomes very difficult to use this design, because of difficulty in getting more number of homogeneous experimental units.

******************************************************************************

68

Objective : C.R.D analysis with unequal replication Kinds of data: Mycelial growth in terms of diameter of the colony (mm) of R. solani isolates on PDA medium after 14 hours of incubation R. solani isolates Mycelial growth Treatment Treatment mean total

Repl. 1 Repl. 2 Repl. 3 (Ti)

RS 1 29.0 28.0 29.0 86.0 28.67

RS 2 33.5 31.5 29.0 94.0 31.33

RS 3 26.5 30.0 56.5 28.25

RS 4 48.5 46.5 49.0 144.0 48.00

RS 5 34.5 31.0 65.5 32.72

Grand total 172.0 167.0 107.0 446.0

Grand mean 34.31

Solution: Here we test whether the treatments differ significantly or not. Grand total = 446 4462 Correction factor = = 15301.23 13 Total sum of squares = (292 + 282 + ⋯ + 34.52 + 312) - CF = 789.27 Treatment sum of squares =( 86)2/3 +(94)2/3 +( 56.5)2/2 +(144)2/3+(65.52/2 -CF = 16063.9 - CF = 762.69 Error sum of squares = Total sum of squares–variety sum of squares = 789.27-762.69 = 26.58

Source of variation Degree of Sum of Mean square Computed F Tabular F 5% freedom squares Treatment 4 762.69 190.67 57.38* 3.84

Error 8 26.58 3.32

Total 12 789.27

Here Fcal is greater than Ftab, it was found that the treatment differ significantly. Next we calculate the LSD and CD as per the formula described above. For example to compare treatment 1 and treatment 2 we calculate

1 1 standard error =√3.32 ∗ ( + )=1.49 and t value at 5% and 8 degree of freedom =2.30 3 3

Now CD or LSD=1.49*2.30=3.44

69

And the difference between treatment means of 1 and 2 =2.66. Hence we find that the treatment 1 and 2 doesnot differ significantly as given in table. The comparison between all the treatments are given below in table along with their significance.

Treatment RS 1 RS 2 RS 3 RS 4 RS 5

RS 1 0.00 2.66 0.42 19.33* 4.05* (3.44) (3.84) (3.44) (3.84)

RS 2 0.00 3.08 16.67* 1.39 (3.84) (3.44) (3.84)

RS 3 0.00 19.75* 4.47* (3.84) (4.21)

RS 4 0.00 15.28* (3.84)

RS 5 0.00

******************************************************************************* Objective: Analysis of CRD with equal replication Kinds of data: Grain yield of rice resulting from use of different foliar and granular insecticides for the control of brown plant hoppers and stem borers, from a CRD experiment with 4 replication ® and 7 treatment (t). Grain yield (kg/ha) Treatment Treatment Treatment R1 R2 R3 R4 total (T) means Dol- mix (1 kg) 2537 2069 2104 1797 8507 2127 Ferterra 3366 2591 2211 2544 10712 2678 DDT + Y-BHC 2536 2459 2827 2385 10207 2552 Standard 2387 2453 1556 2116 8512 2128 Dimecron-Boom 1997 1679 1649 1859 7184 1796 Dimecron-Knap 1796 1704 1904 1320 6724 1681 Control 1401 1516 1270 1077 5264 1316 Grand total (G) 57110 Grand mean 2040

Solution: Here we test whether the treatment differ significantly or not. The Grand total = 57110. Correction factor = (57110)2/28= 116484004 Total sum of squares = (2537 2 + 20692 +….10772) - CF = 124061416 - CF = 7577412.4 Treatment sum of square = (85072 +107122 +------52642)/4 -CF =122071179 – CF = 5587174.9 Error sum of square = 7577412.4-5587174.9 = 1990237.50 ANOVA (CRD with equal replication) of rice yield data Tabular F Source of Variation DF SS Mean Square Fcal 5% 1% 70

Treatment 6 5587174 931196 9.83 2.57 3.81 Error 21 1990238 94773 Total 27 7577412

Hence we find that the treatment differ significantly. After that we calculate Critical difference. 2∗94773 The standard error of difference between treatment means=√ = 217.68 and 4 The tvalue at 5% level of significance and 21 error df =2.08 Now the CD or LSD at 5 % level of significance= 452.70.kg/ha The tvalue at 1% level of significance and 21 error df =2.831 Now the CD or LSD at 1 % level of significance=217.68*2.831=616.33 kg/ha. Comparison between mean yields of a control and each of the six insecticide treatments using the LSD test are given in table below. Treatment Mean yield (kg/ha) Difference From control

T7 2127 811** T 6 2678 1362** T 5 2552 1236** T 4 2128 812** T 3 1796 480* T ns 2 1681 365 T 1 1316 * indicates significant difference at 5 %, ** indicates Significant difference at 1 % and ns indicates non-significant difference

Two-Way ANOVA Two Way ANOVA is an inferential statistical model to analyze three or more than three variances at a time to test the equality & inter-relationship between them. It's a test of hypothesis for several sample means to analyze the inter-relationship between the factors and influencing variables at k levels corresponding to k populations is called as Two way ANOVA. Users may use this 2-way ANOVA test calculator to generate the ANOVA table for the test of hypothesis (H0) for treatment means & subject or class means at a stated level of significance with the help of F-distribution. In this analysis of variance, the observations drawn from the populations should be in same length. This model should satisfy replication, randomization & local control to design statistical experiments. Users may use this 2-way ANOVA test calculator to generate the ANOVA classification table for the test of hypothesis (H0) for treatment means & varieties (class) means at a stated level of significance with the help of F-test.

71

ANOVA Table for Two-Way Classification

ANOVA table for two-way classification shows what are all the formulas & input parameters used in the analysis of variance for more than one factor which involves two or more than two treatment means together with null hypothesis at a stated level of significance.

Sources of Df SS MSS F-ratio variation

Between treatment k – 1 SSR SSR/k - 1= MST MST/MSE= FR

Between block h – 1 SSC SSC/h - 1= MSV MSV/MSE= FC

Error (h - 1)(k - 1) SSE SSE/(k - 1)(h - 1)= MSE

Total N – 1

Notable Points for Two-Way ANOVA Test

The below are the important notes of two-way ANOVA for test of hypothesis for a two or more factors involves three or more treatment or subject means together.

• The null hypothesis H0 : μ1 = μ2 = . . . = μk H0 : μ.1 = μ.2 = . . . = μ.h Shows no significant difference between the variances. Alternative Hypothesis H1 : H1 : μ1 ≠ μ2 ≠ . . . ≠ μk H1 : μ.1 ≠ μ.2 ≠ . . . ≠ μ.h Shows the significant difference among the variances. • State the level of significance α (1%, 2%, 5%, 10%, 50% etc) • The sum of all N elements in all the sample data set is known as the Grand Total and is represented by an English alphabet "G". • The correction factor CF = G2/N = G2/kh • The Total Sum of Squares of all individual elements often abbreviated as TSS is obtained by 2 TSS = ∑∑xij - CF • The sum of squares of all the treatment (row) totals in the two-way table (h x k) often abbreviated as SST is obtained by 2 SST = SSR = ∑ {Ti. /h} - CF • The sum of squares between classes or sum of squares between columns is 2 SSV = SSC = {T.j /k} - CF k is the number of observations in each columns • The sum of squares due to error often abbreviated as SSE is obtained by SSE = TSS - SSR - SSC • The degrees of freedom for Total Sum of Squares TSS = N - 1 = hk - 1 • The degrees of freedom for Sum of Squares between treatments SST = k - 1 • The degrees of freedom for Sum of Squares between varieties SSV = h - 1 • The degrees of freedom for error sum of squares 72

SSE = (k - 1)(h - 1) • The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by MST = SST/(k - 1) • The Mean Sum of Squares for varieties often abbreviated as MSE is obtained by MSV = SSV/(h - 1) • The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by MSE = SSE/(h - 1)(k - 1) • The variance ratio for treatments FR is the higher variance to lower variance FR = MST/MSE or MSE/MST (The numerator should be always high) • The variance ratio for subjects or classes Fc is the higher variance to lower variance Fc = MSV/MSE or MSE/MSV (The numerator should be always high) • The Critical value of F for between treatments (rows) can be obtained by referring the F distribution table for (k-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%, 10% or 50% etc. • The Critical value of F for between varieties (columns) or subjects can be obtained by referring the F distribution table for (h-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%, 10% or 50% etc. • The difference between the treatments (rows) is not significant, if the calculated Fe value is lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted. • The difference between the treatments (rows) is significant, if the calculated F value is greater than the value from the F table. Therefore, the null hypothesis H0 is rejected. • The difference between the subjects or varieties (columns) is not significant, if the calculated Fe value is lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted. • The difference between the subjects or varieties (columns) is significant, if the calculated F value is greater than the value from the F table. Therefore, the null hypothesis H0 is rejected. ******************************************************************************* RANDOMIZED BLOCK DESIGN (RBD) In such situations the principle of local control is adopted and the experimental material is grouped into homogeneous sub groups. The subgroup is commonly termed as block. The blocks are formed with units having common characteristics which may influence the response under study. Advantages and disadvantages of RBD: A) Advantage: 1. The principle advantage of RBD is that it increases the precision of the experiment. This is due to the reduction of experimental error by adoption of local control. 2. The amount of information obtained in RBD is more as compared to CRD. Hence, RBD is more efficient than CRD. Since the layout of RBD involves equal replication of treatments, statistical analysis is simple.

B) Disadvantage: 1. When the number of treatments is increased, the block size will increase. 2. If the block size is large maintaining homogeneity is difficult and hence when more number of treatments is present this design may not be suitable.

73

Analysis:

Let us suppose that we have t number of treatments, each being replicated r number of times. The appropriate statistical model for RBD will be

yeij= +  i +  j + ij , I =1, 2, 3,…….,t; j = 1,2,….r th th where, yij = response corresponding to j replication/block of the i treatment  = general effect

i = additional effect due to i-th treatment and i = 0

 j = additional effect due to j-th replication/block and  j = 0 2 eij = error associated with j-th replication/block of i-th treatment and are i.i.d N(0,  ).

The above model is based on the assumptions that the affects are additive in nature and the error components are identically, independently distributed as normal variate with mean zero and constant variance. Let the level of significance be  . Hypothesis to be tested: The null hypotheses to be tested are

H012:(1)...... 0======it

(2)...... 012======jr Against the alternative hypotheses Hs:(1)' are not equal 1 (2)' ares not equal Let the observations of these n = rt units be as follows: Replications/Blocks Treatments 1 2 …. J …. r Total Mean 1 y11 y12 …. y1j …. y1r y1o y10

2 y21 y22 …. y2j …. y2r y2o y20 : : : : : : : : : I yi1 yi2 …. yij …. yir yio yi0 : : : : : : : : : T yt1 yt2 …. ytj …. ytr yto yt0

Total yo1 yo2 …. yoj …. yor yoo Mean …. …. yo1 yo2 yoj yor

The analysis of this design is the same as that of two-way classified data with one observation per cell discussed in chapter 1 section (1.3). From the above table we calculate the following quantities :

Grand total =  yij = y11+ y 21 + y 31 +...... + ytr = G ij, 2 Correction factor = G = CF rt

74

Total Sum of Squares (TSS) = 2 yij − CF ij, 2222 = yyyyCF112131++++− ...... tr Treatment Sum of Squares (TrSS) t 2  yio =−i=1 CF r yyyyy22222 =++++−123oooioto ...... CF rrrrr Replication Sum of Squares (RSS) r 2  yoj =−j=1 CF t y2 y 2 y 2y2 y 2 =o1 + o 2 + o 3 +...... oj + ...... or − CF t t t t t Error Sum of Squares (by subtraction ) = T SS – TrSS - RSS

ANOVA table for RBD Source Of d.f. SS MS F-ratio Tabulated Tabulated Variation F (0.05) F (0.01) Treatment t-1 TrSS TrMS = TrSS TrMS t −1 ErMS Replication r-1 RSS RMS = RSS RMS (Block) r −1 ErMS Error (t-1)(r-1) ErSS ErMS = ErSS (t-1)(r-1) Total rt-1 TSS

The null hypotheses are rejected at  level of significance if the calculated values of F ratio corresponding to treatment and replication be greater than the corresponding table value at the same level of significance with (t-1), (t-1)(r-1) and (r-1), (t-1)(r-1) degrees of freedom respectively. That means we reject Ho if Fcal > Ftab, otherwise one can not reject the null hypothesis. When the test is non-significant we conclude that there exists no significant differences among the treatments/replications with respect to the particular character under consideration; all treatments/replications are statistically at par.

When the test(s) is (are) significant(s) we reject the null hypothesis and try to find out the replication or the treatments which are significantly different from each other. Like in case of CRD, here also in RBD we use the least significant difference (critical difference value) for comparing difference between the pair of means. The CD’s are calculated as follows:

2ErMS LSD() CD= t    ;(tr−− 1)( 1) t 2

75 where t is the number of treatments and t is the table value of t at α level of  ;( 1tr−− ) ( 1 ) 2 significance and (t-1)(r-1) degrees of freedom. *******************************************************************************

Objective: Analysis of Randomized Block Design Kinds of data: An experiment was conducted in RBD to study the comparative performance of fodder sorghum under rainfed condition. The rearranged data given in below table. Green matter yield of Sorghum (Kg/plot)

Variety I II III IV Total Mean African Tall 22.9 25.9 39.1 33.9 121.8 30.45 Co-11 29.5 30.4 35.3 29.6 124.8 31.2 FS -1 28.8 24.4 32.1 28.6 113.9 28.475 K -7 47 40.9 42.8 32.1 162.8 40.7 Co-24 28.9 20.4 21.1 31.8 102.2 25.55 Total 157.1 142.0 170.4 156.0 625.5

Solution: Here we test whether the varieties differ significantly or not. 625.52 Correction factor = = 19562.51 20 Total sum of squares = (22.92 + 25.92 + ⋯ . +31.82) − 퐶퐹 = 20514.95 – CF = 952.44 157.12+1422+170.42+1562 Bock sum of squares = - CF= 19643.31 – CF =80.80 5 121.82+124.82+113.92+162.82+102.22 Variety sum of squares = − 퐶퐹= 15525.50 - CF = 520.53 4 Error sum of squares = 952.44 – 80.80 – 520.53 = 351.11 By putting the values in ANOVA we get Source of DF SS MSS F cal F tab variation Replication 3 80.80 26.9 < 1 3.490 Variety 4 521 130 4.448* 3.259 Error 12 351 29.3

Total 19

Here we found that the varieties differ significantly.

Variety Mean 2EMS 2(29.2588) K -7 40.7 SE(D) = = r 4 Co -11 31.2 = 14.6294 African Tall 30.45 = 3.8348

CD = t.SE(d) FS -1 28.48 = (2.179)(3.8248) = 8.33 Co -24 25.55 76

Sorgham variety 50 40.7 40 31.2 30.45 28.475 30 25.55 20 10 0 K -7 Co-11 African Tall FS -1 Co-24 Treatment

From the it can be concluded that sorghum variety K-7 produces significantly higher than green matter than all other varieties. The remaining varieties are all on par.

*******************************************************************************

Objective : Analysis of Randomized block design. Kinds of data: The yields of 6 varieties of a crop in lbs., along with the plan of the experiment, are given below. The number of blocks is 5, plot of size is 1/20 acre and the varieties have been represented by A, B, C, D and E and analyze the data and state your conclusions B-I B E D C A F 12 26 10 15 26 62 B-II E C F A D B 23 16 56 30 20 10 B-III A B E F D C 28 9 35 64 23 14 B-IV F D E C B A 75 20 30 14 7 23 B-V D F A C B E 17 70 20 12 9 28 Solution: Null hypothesis H01: There is no significant difference between variety means

1 = 2 = 3 = 4 = 5 = 6

H02: There is no significant difference between block means

1 = 2 = 3 = 4 = 5

(퐺푇)2 Correction factor = 푟푘 ∑ 푣 2 Sum of square due to varieties = 푖 – CF 푟

∑ 푏 2 Block Sum of square(BSS)= 푗 - CF 푘 Total sum of squares (TSS)=∑ ∑ 푦2 – CF Error Sum of Square (ESS)= TSS- VSS- BSS 77

First rearrange the given data Blocks Varieties Block totals Means A B C D E F

B1 26 12 15 10 26 62 ΣB1 = 151 25.17

B2 30 10 16 20 23 56 ΣB2 =155 25.83

B3 28 9 14 23 35 64 ΣB3 = 173 28.83

B4 23 7 14 20 30 75 ΣB4 = 169 28.17

B5 20 9 12 17 28 70 ΣB5 = 156 26.00 Variety ΣA = ΣB = ΣC = ΣD = ΣE = ΣF = GT = 804 - totals 127 47 71 90 142 327 Means 25.4 9.4 14.2 18 28.4 65.4 - -

8042 CF= = 21547.2 30 1272+472+712+902+1422+3272 VSS = − 21547.2 = 31714.4 – 21547.2= 10167.2 5

1512+1552+1732+1692+1562 BSS = − 21547.2 6 = 21608.67 – 21547.2 = 61.47 TSS= (262 + 122 + 152 + ⋯ … . , +282 + 702) -21547.2

= 32194 – 21547.2

= 10646.8

ESS= TSS – BSS – Tr.S.S.

= 10646.8 – 61.47 – 10167.2 = 418.13 ANOVA TABLE Sources of d.f S.S. M.S. F-cal. F- table Value Variation Value Blocks 5-1=4 61.47 15.37 0.74 F0.05 (4, 20) =2.87

Varieties 6-1=5 10167.2 2033.44 97.25 F0.05 (5, 20) = 2.71 Error 29-4-5= 20 418.13 20.91 Total 30-1-29 10646.8

Calculated value of F (Treatments) > Table value of F, H0 is rejected and hence we conclude that there is highly significant difference between variety means.

퐸푀푆 20.91 Where SEm =√ = √ = 2.04 푟 4

SED = √2 * SEm = 1.414 * 2.04 = 2.88 78

Critical difference = SED x t-table value for error d.f. at 5% LOS CD = 2.88 * 2.09 = 6.04

퐸푀푆 20.91 Coefficient of variation =√ * 100 = √ *100 = 17 % 푋̅ 26.8

F E A D C B 65.4 28.4 25.4 18.0 14.2 9.40

(i) Those pairs not scored are significant (ii) Those pairs underscored are non-significant Variety F gives significantly higher yield than all the other varieties; varieties D,C and B are on par and gives significantly higher yield than variety A.

Exercise: Q1. Explain analysis of one way classification? Q2. What do you understand by analysis of variance? Q3.What are assumptions of analysis of variance? Q4.The yields of four varieties of wheat per plot (in lbs.) obtained from an experiment in randomized block design are given below: Variety Replication 1 II III IV V V1 7 9 8 10 10 V2 12 13 15 11 13 V3 15 20 15 18 16 V4 8 10 12 10 8

Analyze the data and state your conclusion.(Ans. Variety Variance=66.13, Error variance=2.59)

Q5.The following table gives the yields in pounds per plot, of five varieties of wheat after being applied to 4,3,2,4 and 3 plots respectively Varieties Yield in lbs. A 8 8 6 10 B 10 9 8 C 8 10 D 7 10 9 8 E 12 8 10 Analyze the data and state your conclusion.(Ans. Variety Variance=1.86, Error variance=2.28) Q6. Write the short notes : (a)Local control (b)Replication ©Advantage of C.R.D.

79

8. Sampling Methods R. S. Solanki Assistant professor (Maths & Stat.), College of Agriculture , Waraseoni, Balaghat (M.P.),India Email id : [email protected]

1. Introduction

The terminology "sampling" indicates the selection of a part of a group or an aggregate with a view to obtaining information about the whole. This aggregate or the totality of all members is known as Population although they need not be human beings. The selected part, which is used to ascertain the characteristics of the population, is called Sample. While choosing a sample, the population is assumed to be composed of individual units or members, some of which are included in the sample. The total number of members of the population and the number included in the sample are called Population Size and Sample Size respectively. The process of generalising on the basis of information collected on a part is really a traditional practice. The annual production of a certain crop in a region is computed on the basis of a sample. The quality of a product coming out of a production process is ascertained on the basis of a sample. The government and its various agencies conduct surveys from time to time to examine various economic and related issues through samples. Sampling methodology can be used by an auditor or an accountant to estimate the value of total inventory in the stores without actually inspecting all the items physically. Opinion polls based on samples is used to forecast the result of a forthcoming election

2. Advantage of sampling over census

The census or complete enumeration consists in collecting data from each and every unit from the population. The sampling only chooses a part of the units from the population for the same study. The sampling has a number of advantages as compared to complete enumeration due to a variety of reasons.

Less Expensive: The first obvious advantage of sampling is that it is less expensive. If we want to study the consumer reaction before launching a new product it will be much less expensive to carry out a consumer survey based on a sample rather than studying the entire population which is the potential group of customers.

Less Time Consuming: The smaller size of the sample enables us to collect the data more quickly than to survey all the units of the population even if we are willing to spend money. This is particularly the case if the decision is time bound. An accountant may be interested to know the total inventory value quickly to prepare a periodical report like a monthly balance sheet and a profit and loss account. A detailed study on the inventory is likely to take too long to enable him to prepare the report in time.

Greater Accuracy: It is possible to achieve greater accuracy by using appropriate sampling techniques than by a complete enumeration of all the units of the population. Complete enumeration may result in accuracies of the data. Consider an inspector who is visually inspecting the quality of finishing of a certain machinery. After observing a large number of such items he cannot just distinguish items with defective finish from good one's. Once such inspection fatigue develops the accuracy of examining the population completely is considerably decreased. On the other hand, if a small number of items is observed the basic data will be much more accurate.

80

Destructive Enumeration: Sampling is indispensable if the enumeration is destructive. If you are interested in computing the average life of fluorescent lamps supplied in a batch the life of the entire batch cannot be examined to compute the average since this means that the entire supply will be wasted. Thus, in this case there is no other alternative than to examine the life of a sample of lamps and draw an inference about the entire batch.

3. Simple Random Sampling

The representative character of a sample is ensured by allocating some probability to each unit of the population for being included in the sample. The simple random sample assigns equal probability to each unit of the population. The simple random sample can be chosen both with and without replacement.

Simple Random Sampling with Replacement (SRSWR): Suppose the population consists of N units and we want to select a sample of size n. In simple random sampling with replacement we choose an observation from the population in such a manner that every unit of the population has an equal chance of 1/N to be included in the sample. After the first unit is selected its value is recorded and it is again placed back in the population. The second unit is drawn exactly in the swipe manner as the first unit. This procedure is continued until nth unit of the sample is selected. Clearly, in this case each unit of the population has an equal chance of 1/N to be included in each of the n units of the sample.

In this case the number of possible samples of size n selected from the population of size N is 푁푛. The samples selected through this method are not distinct.

Simple Random Sampling without Replacement (SRSWOR): In this case when the first unit is chosen every unit of the population has a chance of 1/N to be included in the sample. After the first unit is chosen it is no longer replaced in the population. The second unit is selected from the remaining (N-1) members of the population so that each unit has a chance of (1/N-1) to be included in the sample. The procedure is continued till nth unit of the sample is chosen with probability [1/ (N-n+1)].

In this case the number of possible samples of size n selected from the population of size N is 푁푐푛 . The samples selected through this method are distinct.

Advantages and Disadvantages of Simple Random Sampling:

Advantages: It is a fair method of sampling and if applied appropriately it helps to reduce any bias involved as compared to any other sampling method involved. This sampling method is a very basic method of collecting the data. There is no technical knowledge required and need basic listening and recording skills. Simple random sampling offers researchers an opportunity to perform and a way that creates a lower margin of error within the information collected. It offers an equal chance of selection for everyone within the population group.

Disadvantages: It is a costlier method of sampling as it requires a complete list of all potential respondents to be available beforehand. It relies on the quality of the researchers performing the work. It can require a sample size that is too large. It does not work well with widely diverse or dispersed population groups.

81

4. Selection of Simple Random Sample

The concept of "randomness" implies that every item being considered has an equal chance of being selected as part of the sample. To ensure randomness of selection one may adopt either the Lottery Method or use table of random numbers.

Lottery Method: This is a very popular method of taking a random sample. Under this method, all items of the universe are numbered or named on separate slips of paper of identical size and shape. These slips are then folded and mixed up in a container or drum. A blindfold selection is then made of the number of slips required to constitute the desired sample size. The selection of items thus depends entirely on chance. The method would be quite clear with the help of an example. If we want to take a sample of 10 persons out of a population of 100, the procedure is to write the names of all the 100 persons on separate slips of paper, fold these slips, mix them thoroughly and then make a blindfold selection of 10 slips. The lottery method is very popular in lottery draws where a decision about prizes is to be made. However, while adopting lottery method it is absolutely essential to see that the slips are of identical size, shape and colour, otherwise there is a lot of possibility of personal prejudice and bias affecting the results. The process of writing N number of slips is cumbersome and shuffling a large number of slips, where population size is very large, is difficult. Also human bias may enter while choosing the slips. Hence the other alternative i.e. random numbers can be used.

Random Number Table Method: A random number table is a table of digits. The digit given in each position in the table was originally chosen randomly from the digits 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 by a random process in which each digit is equally likely to be chosen, as demonstrated in the small sample shown below.

Table of Random Numbers

36518 36777 89116 05542 29705 46132 81380 75635 19428 88048 31841 77367 40791 97402 27569 84180 93793 64953 51472 65358 78435 37586 07015 98729 76703 83775 21564 81639 27973 62413 08747 20092 12615 35046 67753 90184 02338 39318 54936 34641 23701 75230 47200 78176 85248 16224 97661 79907 06611 26501 85652 62817 57881 90589 74567 69630 10883 13683 93389 92725 95525 86316 87384 22633 68158

The table usually contains 5-digit numbers, arranged in rows and columns, for ease of reading. Typically, a full table may extend over as many as four or more pages. The occurrence of any two digits in any two places is independent of each other. Several standard tables of random numbers are available, among which the following may be specially mentioned, as they have been tested extensively for randomness: 82

• Tippett’s (1927) random number tables consisting of 41,600 random digits grouped into 10,400 sets of four-digit random numbers. • Fisher and Yates (1938) table of random numbers with 15,000 random digits arranged into 1,500 sets of ten-digit random numbers. • Kendall and Babington Smith (1939) table of random numbers consisting of 1,00,000 random digits grouped into 25,000 sets of four-digit random numbers. • Rand Corporation (1955) table of random numbers consisting of 1,00,000 random digits grouped into 20,000 sets of five-digit random numbers. • C.R. Rao, Mitra and Mathai (1966) table of random numbers.

How to use a random number table: This method is one from a variety of methods of reading numbers from random number tables. i. Assume you have the test scores for a population of 200 students. Each student has been assigned a number from 1 to 200. We want to randomly sample only 5 of the students. ii. Since the population size is a three-digit number, we will use the first three digits of the numbers listed in the table. iii. Without looking, point to a starting spot in the above random number table. Assume we land on 93793 (2nd column, 4th entry). iv. This location gives the first three digits to be 937. This choice is too large (> 200), so we choose the next number in that column. Keep in mind that we are looking for numbers whose first three digits are from 001 to 200 (representing students). v. The second choice gives the first three digits to be 375, also too large. Continue down the column until you find 5 of the numbers whose first three digits are less than or equal to 200. vi. From this table, we arrive at 200 (20092), 023 (02338), 108 (10883), 070 (07015), and 126 (12615).

Students 23, 70, 108, 126, and 200 will be used for our random sample. Our sample set of students has been randomly selected where each student had an equal chance of being selected and the selection of one student did not influence the selection of other students. ******************************************************************************

Objective: Selection of simple random sample using random number table. Kinds of data: The number of diseased plants (out of 9) in 24 areas are in the following table:

S.No. of areas 1 2 3 4 5 6 7 8 9 10 11 12 Diseased Plants 1 4 1 2 5 1 1 1 7 2 3 3 S.No. of areas 13 14 15 16 17 18 19 20 21 22 23 24 Diseased Plants 2 2 3 1 2 7 2 6 3 5 3 4

Select a simple random sample with and without replacement of size 6. Compute the average diseased plants based on the sample. Compare this with the average diseased plants of the population. Solution: Simple random sample with replacement: We have the diseased plants of population of 24 areas. Each area has been assigned a number from 1 to 24. We want to randomly sample with replacement of only 6 of the 24 areas. Step 1: Since the population size is a two digit number, we will use the first two digits of the numbers listed in the random number table (see appendix). Step 2: Without looking, point to a starting spot in the random number table. Assume we land on 72918 (4th column, 21th entry). This location gives the first two digits to be 72. This choice is too

83 large (> 24), so we choose the next number in that column. Keep in mind that we are looking for numbers whose first two digits are from 01 to 24 (representing areas). Step 3: The second choice (12468) gives the first two digits to be 12 (≤ 24), so we accept it. Step 4: Continue down the column until we find 6 of the numbers whose first two digits are less than or equal to 24. From this table, we arrive at 12 (12468), 17 (17262), 02 (02401), 11 (11333), 10 (10631) and 17 (17220). Areas 02, 10, 11, 12, 17 and 17 will be used for our random sample (area no 17 repeat twice because our random sample is with replacement). Average diseased plants based on simple random sample with replacement:

S.No. of areas 02 10 11 12 17 17 Diseased Plants 4 2 3 3 2 2

4 + 2 + 3+ 3+ 2 + 2 Average diseased plants = = 2.6  3 6 Simple random sample without replacement: We have the diseased plants of population of 24 areas. Each area has been assigned a number from 1 to 24. We want to randomly sample without replacement of only 6 of the 24 areas. Step 1: Since the population size is a two digit number, we will use the first two digits of the numbers listed in the random number table (see appendix). Step 2: Without looking, point to a starting spot in the random number table. Assume we land on 13211 (7th column, 17th entry). This location gives the first two digits to be 13. This choice is (≤ 24), so we choose this number. Keep in mind that we are looking for numbers whose first two digits are from 01 to 24 (representing areas). Step 3: Continue down the column until we find 6 of the numbers (repeated numbers not allowed in SRSWOR) whose first two digits are less than or equal to 24. From this table, we arrive at 22 (22250), 12 (12944), 04 (04014), 19 (19386), 01 (01573) and 20 (20963). Areas 01, 04, 12, 19, 20 and 22 will be used for our random sample. Average diseased plants based on simple random sample without replacement:

S.No. of areas 01 04 12 19 20 22 Diseased Plants 1 2 3 2 6 5 1+ 2 + 3+ 2 + 6 + 5 Average diseased plants = = 3.1  3 6 Average diseased plants based on population: Average diseased plants =

1+ 4 +1+ 2 + 5 +1+1+1+ 7 + 2 + 3+ 3+ 2 + 2 + 3+1+ 2 + 7 + 2 + 6 + 3+ 5 + 3+ 4 = 2.9  3 24 Conclusion: From the above calculation it has been concluded that the average number of diseased plants based on simple random samples with and without replacement and population are almost equal to 3. ******************************************************************************

Objective: Selection of simple random sample under SRSWOR. Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5. Draw a sample of size n=3 using SRSWOR and show sample mean is an estimate of population mean. 84

Solution: Number of all possible samples of size n=3 under SRSWOR is given by 푁푐푛 = 5푐3=10. ∑ 푦 15 ∑ 푦 Population mean 푦̅̅̅̅ = 푖 = =3 and Compute the mean of each sample 푦̅̅̅ = 푖 푁 푁 5 푛 푛 The 10 possible samples are given below in the table. S.No. Possible Sample mean

samples 풚̅̅̅풏̅ 1. 1,2,3 2.0 2. 2,3,4 3.0 3. 3,4,5 4.0 4. 4,5,1 3.33 5. 5,1,2 2.67 6. 1,3,4 2.67 7. 2,4,5 3.67 8. 3,5,1 3.0 9. 4,1,2 2.33 10. 5,2,3 3.33 Total 30.0

Now we have to check whether E (푦̅̅푛̅) = 푦̅̅̅푁̅

∑ 푦̅̅̅푛̅ 30 E (푦̅̅푛̅)= = =3 =푦̅̅̅푁̅ 푁푐푛 10

Hence we can say, that sample mean 푦̅̅푛̅ is an estimate of population mean 푦̅̅̅푁̅ .

*******************************************************************************

Objective: Selection of simple random sample under SRSWR. Kind of data: Consider a finite population of size N=5 including the values of sampling units as (1,2,3,4,5). Enumerate all possible samples of size n=2 using SRSWR and check whether the sample mean is an estimate of population mean. Solution: Number of all possible samples of size n=2 under SRSWOR is given by 푁푛 = 52=25. ∑ 푦 15 ∑ 푦 The Population 푦̅̅̅̅ = 푖 = =3 and Compute the mean of each sample 푦̅̅̅ = 푖 푁 푁 5 푛 푛

S.No. Possible Sample mean S.No. Possible Sample mean

Samples 풚̅̅풏̅ Samples 풚̅̅풏̅ 1 1,2 1.5 13 4,1 2.5 2 1,3 2.0 14 5,1 3.0 3 1,4 2.5 15 3,2 2.5 4 1,5 3.0 16 4,2 3.0 5 2,3 2.5 17 5,2 3.5 6 2,4 3.0 18 4,3 3.5 7 2,5 3.5 19 5,3 4.0 8 3,4 3.5 20 5,4 4.5 9 3,5 4.0 21 1,1 1.0 10 4,5 4.5 22 2,2 2,0 11 2,1 1.5 23 3,3 3.0

85

12 3,1 2.0 24 4,4 4.0 25 5,5 5.0 Total 75.0

Now we have to check whether E (푦̅̅푛̅)= 푦̅̅̅푁̅ ∑ 푦̅̅̅̅ 75 E (푦̅̅̅)= 푛 = =3 =푦̅̅̅̅ 푛 푁푛 25 푁

Hence we can say that sample mean 푦̅̅푛̅ is an estimate of population mean. ******************************************************************************* Exercise: Q1. The data below indicate the number of workers in the factory for twelve factories

Factory 1 2 3 4 5 6 7 8 9 10 11 12 No. of 2145 1547 745 215 784 3125 126 471 841 3215 2496 589 Workers

Select a simple random sample without replacement of size four with the help of random number table (see Appendix). Compute the average number of workers per factory based on the sample. Compare this number with the average number of workers per factory in the population.

Q2. A class has 115 students. Select a simple random sample with replacement of size 15.

Q3. The following data are the yields (q/ha) of 30 varieties of paddy maintained in a research station for breeding trials: 49 78 57 55 45 26 70 21 75 94 56 62 64 79 85 47 67 43 31 38 33 50 37 75 32 42 52 22 63 40

Select a simple random sample without replacement of size 8. Compute the average yield of paddy based on the sample. Compare this yield with the average yield of paddy in the population.

Q4. A population have 7 units 1,2,3,4,5,6,7. Write down all possible samples of size 2 (without replacement), which can be drawn from the given population and verify that the sample mean is an estimate of the population mean.

Q5. How many random samples of size 5 can be drawn from a population of size 10 if sample is done with replacement. ********************************************************************************

86

REFERENCES:

1. Practicals in Statistics , by Dr.H.L.Sharma 2. Statistical Methods, by G.W.Snedecor. 3. Experimental Designs and : Methods and Applications, by H.L.Sharma 4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel 5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar 6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor 7. A Text book of Agricultural Statistics, R. Rangaswamy, New Age International (P) Limited, publishers 8. Mishra, P. (Ed.), Homa, F. (Ed.).. Essentials of Statistics in Agriculture Sciences. New York: Apple Academic Press., Inc CRC ( Taylor and Francis Group) 9. P.K.Sahu (2004) . “Agriculture and Applied Statistics-I” ,Kalyani publisher.

87