Population Genetics

Population genetics Tutorial Peter Pfaffelhuber and Pleuni Pennings November 30, 2007 University of Munich Biocenter Großhaderner Straße 2 D-82152 Planegg-Martinsried 3 Copyright (c) 2007 Peter Pfaffelhuber and Pleuni Pennings. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Preface This manuscript is written for the course population genetics computer lab given at the LMU Biozentrum, Martinsried, in the summer 2007. The course consists of nine sections which contain lectures and computerlabs. The course is meant for you if you a biologist by training, but is certainly also useful if you are bio-informatician, mathematician or statistician with an interest in population genetics. We expect you to have a basic understanding of genetics, statistics and modelling, although we have introductions to these subjects in the course. You get insights into population genetics from a theoretical but also from an applied side. This results in developing theory on the one side and applying it to data and simu- lations on the other. After following the course you should have a good understanding of the main methods in molecular population genetics that are used to analyse data. If you want to extend you insights, you might wish to look at books like Durrett (2002), Gillespie (2004), Halliburton (2004), Hartl and Clark (2007), Hedrick (2005) or Nei (1987). We are grateful to Tina Hambuch, Anna Thanukos and Montgomery Slatkin, who developed former versions of this course from which we profited a lot. Agnes Rettelbach helped us to fit the exercises to the needs of the students and Ulrike Feldmann was a great help with the R-package labpopgen which comes with this course. Toby Johnson kindly provided material that originally appeared in (Johnson, 2005) which can now be found in Chapters 1 and 9. Peter Pfaffelhuber and Pleuni Pennings 5 Contents 1 Polymorphism in DNA 9 1.1 The life cycle of DNA . .9 1.2 Various kinds of data . 12 1.3 Divergence and estimating mutation rate . 13 2 The Wright-Fisher model and the neutral theory 21 2.1 The Wright-Fisher model . 22 2.2 Genetic Drift . 26 2.3 The coalescent . 30 2.4 Mutations in the infinite sites model . 35 3 Effective population size 39 3.1 The concept . 40 3.2 Examples . 45 3.3 Effects of population size on polymorphism . 48 4 Inbreeding and Structured populations 53 4.1 Hardy-Weinberg equilibrium . 53 4.2 Inbreeding . 54 4.3 Structured Populations . 58 4.4 Models for gene flow . 64 5 Genealogical trees and demographic models 69 5.1 Genealogical trees . 69 5.2 The frequency spectrum . 73 5.3 Demographic models . 77 5.4 The mismatch distribution . 82 6 Recombination and linkage disequilibrium 85 6.1 Molecular basis of recombination . 85 6.2 Modelling recombination . 88 6.3 Recombination and data . 93 6.4 Example: Linkage Disequilibrium due to admixture . 99 7 8 CONTENTS 7 Various forms of Selection 101 7.1 Selection Pressures . 101 7.2 Modelling selection . 104 7.3 Examples . 110 8 Selection and polymorphism 115 8.1 Mutation-Selection balance . 115 8.2 The fundamental Theorem of Selection . 118 8.3 Muller's Ratchet . 119 8.4 Hitchhiking . 123 9 Neutrality Tests 129 9.1 Statistical inference . 129 9.2 Tajima's D ................................... 133 9.3 Fu and Li's D .................................. 138 9.4 Fay and Wu's H ................................ 140 9.5 The HKA Test . 141 9.6 The McDonald{Kreitman Test . 147 A R: a short introduction 151 Chapter 1 Polymorphism in DNA DNA is now understood to be the material on which inheritance acts. Since the 1980s it is possible to obtain DNA sequences in an automated way. One round of sequencing - if properly executed and if the sequencer works well - now gives approximately 500-700 bases of a DNA stretch. Automated sequencers can read from 48 to 96 such fragments in one run. The whole procedure takes about one to three hours. As can be guessed from these numbers there is a flood of data generated by many labs all over the world. The main aim of this course is to give some hints how it is possible to make some sense out of these data, especially, if the sequences come from individuals of the same species. 1.1 The life cycle of DNA The processing of DNA in a cell of an individual runs through different stages. Since a double strand of DNA only contains the instructions how it can be processed, it must be read and then the instructions have to be carried out. The first step is called transcription and results in one strand of RNA per gene. The begin and end of the transcription determine a gene1. In this first step intragenic regions are cut out. The second step is called translation and results in proteins which the cell can use and process. For transcription exactly one DNA base is transcribed into one base of RNA but for translation three bases of RNA encode an amino acid. The combinations of the three base pairs are called codons and the translation table of which codon gives which amino acid is the genetic code. There is a certain redundancy in this mechanisms because there are 43 = 64 different codons, whereas only 20 different amino acids are encoded. That is because certain amino acids are represented by more than one set of three RNA bases. The third basepair of the codon is often redundant, which means that the amino acid is already determined by the first two RNA bases. In both of these steps not the whole source, but only part of the information is used. In the transcription phase only the genes are taken from the whole DNA and used to produce pre-mRNA. Then introns are excised and the exons spliced together. The result 1The name gene is certainly not uniquely used with this meaning. 9 10 CHAPTER 1. POLYMORPHISM IN DNA is mRNA, which only has the transcribed exon sequences. The resulting messenger or mRNA transcript is then translated into a polypeptide or protein on the basis of the genetic code. As DNA is the material of genetic inheritance it must somehow be transferred from an- cestor to descendant. Roughly speaking we can distinguish two reproduction mechanisms: sexual and asexual reproduction. Asexually reproducing individuals only have one parent. This means that the parent passes on its whole genetic material to the offspring. The main example for asexual reproduction is binary fission which occurs often in prokaryotes. An- other instance of asexual reproduction is budding which is the process by which offspring is grown directly from the parent or the use of somatic cell nuclear transfer for reproductive cloning. Different from asexual is sexual reproduction. Here all individuals have two parents. Each progeny receives a full genomic copy from each parent. That means that every individual has two copies of each chromosome which both carry the instructions the individual would need to build its proteins. So there is an excess of information stored in the individual. When an individual passes on its genetic material to its offspring it only gives one set of chromosomes (which is a mixture of material coming from its mother and its father). The child now has two sets of chromosomes again, one from its mother and one from its father. The reduction from a diploid set of chromosomes to a haploid one is done during meiosis, the process when gametes are produced which have half of the number of chromosomes found in the organism's diploid cells. When the DNA from two different gametes is combined, a diploid zygote is produced that can develop into a new individual. Both with asexual and sexual reproduction mutations can accumulate in a genome. These are 'typos' made when the DNA is copied. We distinguish between point mutations and indels. A point mutation is a site in the genome where the exact base A, C, G, or T, is exchanged by another one. Indels (which stands as an abbreviation for insertions and deletions) are mutations where some DNA bases are erroneously deleted from the genome or some other are inserted. Often we don't know whether a difference between two sequences is caused by an insertion or a deletion. The word indel is an easy way to say one of the two must have happened. As we want to analyse DNA data it is important to think about what data will look like. The dataset that we will look at in this section will be taken from two lines of the model organism Drosophila melanogaster. We will be using DNASP to analyse the data. A Drosophila line is in fact a population of identical individuals. By inbreeding a population will become more and more the same. This is very practical for sequencing experiments, because it means that you can keep the DNA you want to study though you kill individuals that carry this DNA. Exercise 1.1. Open DNASP and the file 055twolines.nex. You will find two sequences. When you open the file a summary of your data is displayed. Do you see where in the genome your data comes from? You can get an easy overview of your data in clicking Overview->Intraspecific Data. How many variable nucleotides (usually called point 1.1.

Load more