Probability and Statistics for Bioinformatics and Genetics Course

Probability and Statistics for Bioinformatics and Genetics Course Notes Paul Maiste The Johns Hopkins University These notes are meant solely to be used as study notes for students in 550.435 - Bioinformatics and Statistical Genetics, Spring 2006. c Copyright 2006 Paul Maiste, except for various figures and definitions. May 2, 2006 Contents 1 Introduction to Cell Biology and Genetics 5 1.1 The Cell, DNA, and Chromosomes ................................. 5 1.2 Function of the Cell and DNA .................................... 9 1.3 Genetics ................................................ 13 1.4 Genetic Inheritance .......................................... 19 2 Probability and Statistics Review 28 2.1 Introduction to Random Variables ................................. 28 2.2 Discrete Random Variables ..................................... 30 2.2.1 Probability Distribution ................................... 30 2.2.2 Cumulative Distribution Function ............................. 32 2.3 Continuous Random Variables .................................... 33 2.3.1 General Details About a Continuous Random Variable . 33 2.3.2 Properties of the density function f(x) ........................... 35 2.3.3 Cumulative Distribution Function ............................. 36 2.4 Expected Value and Variance .................................... 37 2.5 Percentiles of a Random Variable .................................. 38 2.6 Common Discrete Distributions ................................... 39 2.6.1 Bernoulli Random Variable ................................. 39 2.6.2 Binomial Random Variable ................................. 40 2.6.3 Geometric Random Variable ................................ 41 2.6.4 Negative Binomial Random Variable ............................ 44 2.6.5 Poisson Random Variable .................................. 44 2.6.6 Multinomial Distribution .................................. 46 2.6.7 Summary ........................................... 47 2.7 Common Continuous Distributions ................................. 47 2.7.1 Normal Distribution ..................................... 48 2.7.2 Gamma Distribution ..................................... 52 2.7.3 Exponential Distribution .................................. 55 2.7.4 Beta Distribution ....................................... 57 2.7.5 Standard Normal Distribution ............................... 58 1 CONTENTS 2 2.7.6 t-distribution ......................................... 59 2.7.7 Chi-squared distribution ................................... 60 2.7.8 F-distribution ......................................... 60 2.7.9 Using the Computer for Distribution Calculations .................... 62 2.8 Multiple Random Variables ..................................... 62 2.8.1 Random Samples and Independence ............................ 63 2.8.2 Joint Probability Distributions ............................... 64 2.8.3 Statistic ............................................ 66 2.9 Analysis of Probability Models and Parameter Estimation .................... 67 2.9.1 Overview of Problems to be Solved ............................. 67 2.9.2 Method of Moments ..................................... 70 2.9.3 Maximum Likelihood Estimation .............................. 73 2.9.4 Confidence Intervals ..................................... 74 2.10 Hypothesis Testing .......................................... 75 2.10.1 General Setup of a Hypothesis Test ............................. 75 2.10.2 The Decision Process ..................................... 77 2.10.3 Errors in Hypothesis Testing ................................ 81 2.11 Chi-squared Goodness-of-Fit Test .................................. 83 2.11.1 Model is Completely Specified ............................... 83 2.11.2 Model is not Completely Specified ............................. 87 2.11.3 Goodness-of-Fit Test for a Discrete Distribution ..................... 89 2.11.4 More on Expected Counts .................................. 91 2.12 Likelihood Ratio Tests ........................................ 92 2.13 Tests for Individual Parameters ................................... 92 3 Genetic Frequencies and Hardy-Weinberg Equilibrium 97 3.1 Notation ................................................ 97 3.2 Hardy-Weinberg Equilibrium .................................... 98 3.3 Derivation of Hardy-Weinberg Proportions .............................100 3.4 Maximum Likelihood Estimation of Frequencies . 102 3.4.1 Multinomial Genetic Data ..................................104 3.4.2 Non-HWE Locus MLEs ...................................107 3.4.3 The E-M Algorithm for Estimating Allele Frequencies . 110 3.5 Test for Hardy-Weinberg Equilibrium ...............................113 3.6 Fisher’s Exact Test ..........................................115 3.6.1 Fisher’s Exact Test - Theory ................................115 3.6.2 Fisher’s Exact Test - Methodology .............................116 3.6.3 Fisher’s Exact Test - Example ...............................117 3.6.4 More than Two Alleles ....................................119 CONTENTS 3 4 Linkage and other Two-Locus and Multi-Locus Analyses 121 4.1 Review of Linkage ..........................................121 4.1.1 Measuring Distance .....................................121 4.2 Estimating rAB ............................................122 4.2.1 Backcross Experiment ....................................123 4.2.2 F2 Experiment ........................................126 4.2.3 Human Pedigree Analysis ..................................131 4.2.4 Genetic Markers .......................................133 4.2.5 Genetic Markers in Disease Analysis ............................135 4.3 Constructing Genetic Maps .....................................136 4.3.1 Relationship between Map Distance and Recombination Frequency . 136 4.3.2 Ordering Loci and Map Construction . 139 4.4 DNA Fingerprinting .........................................140 4.4.1 Basics of a DNA Profile ...................................140 4.4.2 DNA Profile Probabilities ..................................143 5 Sequence Analysis and Alignment 150 5.1 Overview ...............................................150 5.2 Single DNA Sequence Analysis ...................................150 5.2.1 Composition of Bases ....................................151 5.2.2 Independence of Consecutive Bases .............................154 5.3 Comparing and Aligning Two Sequences ..............................156 5.3.1 Evolutionary Background ..................................156 5.3.2 Sample Sequences ......................................156 5.3.3 Statistical Significance of Matching Sequences . 158 5.3.4 Aligning Sequences ......................................160 5.3.5 Heuristic Algorithms: BLAST and FASTA . 168 5.3.6 Probabilistic Background for these Scoring Systems . 168 5.4 Markov Chains ............................................171 5.4.1 Basics .............................................172 5.4.2 Notation and Setup .....................................172 5.4.3 Modeling DNA Sequences as Markov Chains . 174 5.4.4 CpG Islands ..........................................175 5.4.5 Testing for Independence of Bases .............................177 5.5 Other Sequence Problems in Bioinformatics ............................178 5.5.1 Gene Finding .........................................178 5.5.2 Identification of Promoter Regions .............................185 6 Microarray Analysis 190 6.1 Overview ...............................................190 6.2 Background on Gene Regulation and Expression . 190 6.3 Overview of Microarray Experiments ................................191 CONTENTS 4 6.3.1 Primary Purpose and Key Questions ............................191 6.3.2 Microarray Technology ....................................192 6.3.3 The Microarray Experiment .................................193 6.3.4 Image Analysis of the Microarray ..............................194 6.3.5 Within Array Data Manipulation ..............................195 6.3.6 Between Array Normalization ................................197 6.3.7 Microarray Data on the Web ................................198 6.4 Microarray Data Analysis ......................................198 6.4.1 Identifying Differently Expressed Genes . 198 6.4.2 Identifying Genes with Patterns of Co-expression . 201 6.4.3 Classification of New Samples ................................205 Chapter 1 Introduction to Cell Biology and Genetics This chapter is an introduction to some of the necessary background in cell biology and genetics. Further topics will be covered throughout the course. Those who have a sound background in these areas already may find it only necessary to skim through this chapter before moving on. Additionally, we will only cover just the very basic concepts to get started. To get a more comprehensive and detailed background in cell biology and genetics, please refer to the many good texts that have been written on these topics. 1.1 The Cell, DNA, and Chromosomes The cell is the basic structural and functional unit of all organisms. A single cell is the lowest form of life thought to be possible. Most organisms consist of more than one cell, each of which becomes specialised into particular functions towards the cause of the entire organism (such as liver cells, skin cells, etc). Cells possess many structures inside them that contain and maintain the building blocks of life called organelles. Animal cells and plant cells differ fundamentally. DNA exists in the nucleus of the cell. Figure 1.1 shows an overview of the various structures

Probability and Statistics for Bioinformatics and Genetics Course

Bioinformatics 1

13 Genomics and Bioinformatics

Use of Bioinformatics Resources and Tools by Users of Bioinformatics Centers in India Meera Yadav University of Delhi, [email protected]

Challenges in Bioinformatics

Regulatory Motifs of DNA

Comparing Bioinformatics Software Development by Computer Scientists and Biologists: an Exploratory Study

Database Technology for Bioinformatics from Information Retrieval to Knowledge Systems

Motif Discovery in Sequential Data Kyle L. Jensen

A General Pairwise Interaction Model Provides an Accurate Description of in Vivo Transcription Factor Binding Sites

Bioinformatics Careers

Problems for Structure Learning: Aggregation and Computational Complexity

A New Sequence Logo Plot to Highlight Enrichment and Depletion