Answer Hints to Bioinformatics Tutorial

Total Page:16

File Type:pdf, Size:1020Kb

Answer Hints to Bioinformatics Tutorial

Answer Hints to Bioinformatics Tutorial: Differential Gene Expression Analysis

1. Book Question, examples provided in lecture notes!

2. N D FC = (log(D) - log(N))/log(2) A 128 256 1 B 256 128 -1 C 1024 512 -1 D 1000 1000 0

3. T-test a. When would you use a t-test as opposed to a z-test? Use t-test for studying small data sets, and the z-test for studying larger data sets. The z-test assumes that the data set (sample) follows a normal distribution. The t-test assumes that the data set (sample) was drawn from a normal distribution but because we only have a small sample, the sample itself follows a t-distribution. b. What is meant by paired and unpaired experiments? How do they affect the calculation of a t-test? Paired experiments involve measurements from the same individuals (or very similar individuals e.g. twins) under different conditions. In such a case you can get away by comparing the measurements of each individual directly. In unpaired experiments, the assumption does not hold (so the individuals are different and you cannot relate individual measurements), so you have to compare the averages across the two data sets. The formulae for paired/unpaired are given in the lecture notes. c. What is meant by a two tail t-test? Right tail / Left tail t-tests? In two tail tests you are only interested if the two populations are different (you don’t care if the change was positive or negative so long as the measurements in one group are far enough in either way from the mean value of the other group). Right tail and Left tail tests are stronger, they insist that the measurements in one group are higher/lower than those in the other group. d. What is meant when we say that the t-value is a Signal-to-Noise ratio The Signal and Noise represent the two components of the t-value (Signal represents the numerator, Noise represents the denominator). Signal is the average difference between both groups (High signal means the difference is high), and Noise is the fluctuation in that difference (Low noise means small fluctuations). A large SNR means the differences are high and the fluctuation (noise) is low. e. What is the number of degrees of freedom for a paired t-test when each of the samples has 10 data elements? 10-1 = 9 f. What is the number of degrees of freedom for two unpaired data sets, the first having 4 elements and the second having 6 elements? 4+6-2=8

4. P-value a. What does a p-value of 1 mean? Hard Luck! You have just proved the null hypothesis! In case you were trying to check whether a particular value does not belong to a given population, you just discovered that this value coincides exactly with the mean for the population. In case of testing for differences between means of two samples, you have just proved that there is no difference between their means. The area under the curve between this value and +/- infinity is 1. b. What does a p-value of 0.05 mean? Explain your answer graphically using a normal distribution. Congratulations you just disproved the null hypothesis. P-value of 0.05 means the probability of rejecting the null hypothesis is very high. In case you were trying to check whether a particular value does not belong to a given population, you just discovered that this value is very far from the mean of the population, in fact the probability that this value belongs to the population is less than 5%. In case of testing for differences between means of two samples, you have just proved that there is high probability that their means are different.

c. What is the difference between a normal-distribution and t-distribution t-distribution is lower at the mean, and flatter, i.e. it takes longer to reach zero on both sides. For any value on the x-axis, the area under the curve to the left (or right) of that value is bigger for the t-distribution than it is for the normal distribution. Note that the t-distribution approximates to a normal distribution when the number of degrees of freedom is high (>30)

d. What is meant by a critical t-value for a p=0.05, how does this value depend on the number of samples in an experiment? This is the value on the x axis, where the area under the curve to its right is 0.05 For your experiment to have a p value of 0.05, the t-value you calculate must be greater than the critical t-value. Both the t-value you calculate and the critical t-value change as the number of degrees of freedom changes. e. Using the p-value table at the bottom of the next page, find the critical t-value for a paired t- test (2 samples each having 4 elements) such that provides a 95% confidence that the two samples are different. V=3, and it is a two-tailed distribution. t value represents the value for which the area under the curve should be 0.025. (since the curve is symmetric and it is a two-tailed test). In the new table below this is 3.182 (Please note this value was missing in the original tutorial sheet). T v p 10.95 6 0.000034 10.95 3 0.001631 12.05 6 0.000020 12.05 3 0.001230 8.4 6 0.000155 8.4 3 0.003539 2.353 6 0.056825 3.182 3 0.025 2.353 3 0.100033 5. Volcano Plot a. Explain the volcano plot method for assessing the effect and significance over a large number of genes. Why is it useful? You are trying to compare a very large number of fold changes to quickly assess which genes have an effect that is both high and significant. You use a scatter plot, each point represents a gene. The co-ordinates represent the magnitude of the effect for that gene and its significance. b. How are effect and significance calculated in the volcano plot? Effect is calculated as the difference between the two population means, Significance is calculated by calculating the p-value from a t-test. c. What is the numerical interpretation of the X-axis in a volcano plot? This represents the average fold change. A value of 0 means no change, a value of +1 means the effect in the gene is doubled (-1 effect is halved), a value of +2 means the effect is 4 times, etc … d. What is the numerical interpretation of the y-axis in the volcano plot? This represents the number of decimal points in the p-value calculated, the higher you are on the x- axis, the lower the p-value (and hence the higher your confidence that the effect is true and not just by chance).

6. The table below shows gene expression values for a number genes. Each gene is measured for the same type of tissue cell .in normal state in four samples (N1..N4) and in diseased state in another set of four samples (D1..D4)

Gene N1 N2 N3 N4 D1 D2 D3 D4 A 10 20 30 40 110 120 130 140 B 11 18 27 44 50 60 70 80 C 9 17 32 43 15 25 35 45 D 10 20 30 40 1 11 21 31 E 10 20 30 40 20 10 40 30 F 10 20 48 40 100 120 130 70 G 100 120 130 140 10 20 30 40 H 50 60 70 80 10 20 30 40 I 14 26 33 37 10 20 30 40 J 1 11 21 31 10 20 30 40 K 19 8 42 46 10 20 30 40 L 110 120 130 70 10 20 30 40 M 10 20 30 40 10 20 30 40 N 10 20 30 40 120 130 140 150 O 11 19 26 36 110 120 130 70 P 100 120 130 70 10 20 30 40 Q 120 130 140 150 10 20 30 40 R 120 130 140 150 10 20 30 40 S 10 10 35 40 100 120 130 140 T 11 19 32 39 110 120 130 140 U 20 30 20 30 25 40 35 37 V 100 100 100 100 100 100 100 100

a. Consider Gene V. Without going through lengthy calculations, is there any change between both states. What do you expect the p-value to be? For Gene V, all measurements are 100 in both groups. So there is no change between the expression values for the normal and diseased states. The p-value is going to be 1.

b. Calculate t-value for Genes A, E, N,Q, S, V I used Excel

Gene WT1 WT2 WT3 WT4 KO1 KO2 KO3 KO4 Mean WT SD WT Mean KO SD KO A 10 20 30 40 110 120 130 140 25 12.90994 125 12.90994 E 10 20 30 40 20 10 40 30 25 12.90994 25 12.90994 N 10 20 30 40 120 130 140 150 25 12.90994 135 12.90994 Q 120 130 140 150 10 20 30 40 135 12.90994 25 12.90994 S 10 10 35 40 100 120 130 140 23.75 16.00781 122.5 17.07825 V 101 101 101 101 101 101 101 101 101 1E-14 101 1.00E-14 Significance Gene Effect Unpaired T P_Unpaired A 100 10.95445115 0.000034 E 0 0 1 N 110 12.04989627 0.00002 Q -110 -12.04989627 0.000020 S 98.75 8.437423426 0.000151 U 0 0 1

7. Using the table of p-values below: a. Calculate the effect and significance for genes A, E, N, Q, S, V and plot them on a scatter plot (Volcano plot) Calculations from above table plotted in figure below, each gene is represented by a square on the plot and labelled. b. Compare the effect and significance between genes A and S Directly from Plot S has higher effect but lower significance than A. N Q A

S

E,V

Recommended publications