Molecular Ecology Lectures 15 hours (45 min): 5 x 3 Written exam: test

Lab 15 hours: 5 x 3 The lab test must be passed before sitting to the exam

3 x wet lab 2 x computer lab http://www.eko.uj.edu.pl/molecol/index.php/dla- studentow/kursy/molecular-ecology Textbooks

• Frankham et al.: Introduction to Conservation Genetics • Freeland et al.: Molecular Ecology • Futuyma: • Hall: Phylogenetic Trees Made Easy Lectures

• Molecular ecology: integration of ecology, evolutionary and molecular • What are molecular markers? • Why are molecular markers essential for the study of ecology and evolution? • Most important molecular techniques • Application of molecular techniques to address research problems/ obtain information impossible to get by other means Molecular Ecology

• Application of molecular genetic techniques for addressing research question in the fields of ecology, , biological conservation and evolutionary/behavioral ecology • The name appears at the beginning of 1990s • Some techniques in use since 1960s • Rapid development possible due to the introduction of various DNA-based molecular markers • Not a simple application of molecular techniques to study „old” ecological problems Molecular Ecology

Problems tractable mainly or only with molecular techniques: • Parentage and kinship analysis • Individual identification • Non-invasive methods; analysis of biological traces • Inferring history of populations and species with no morphological differentiation • Genetic analysis of extinct species • Quick-and-dirty, universal methods of biodiversity assessment at various scales Molecular ecology is intimately linked with evolutionary theory

Molecular data allow testing crucial aspects of the Modern Synthesis Predictions derived from the Modern Synthesis form the basis for inferring ecological and evolutionary processes from molecular data • Neutral theory of molecular evolution • Molecular clocks • Population genetics • Coalescence • Ecological genomics • Molecular basis of adaptation Molecular Ecology

• Molecular Ecology (IF 6.3) and Molecular Ecology Resources (IF=7.4), leading journals in the fields of ecology and evolution • population structure and phylogeography • reproductive strategies • relatedness and kin selection • sex allocation • population genetic theory • analytical methods development • conservation genetics • speciation genetics • microbial biodiversity • evolutionary dynamics of QTLs • ecological interactions • molecular adaptation and environmental genomics • impact of genetically modified organisms Why are molecular techniques so useful in ecology? • Heritable traits are analyzed, traits with substantial phenotypic plasticity are avoided • Universal – from viruses to man • Usually provide more precise results and are quicker • Identification of genes responsible for the analyzed features and mechanisms – genetic basis of ecological and evolutionary processes • DNA-based techniques require minute amounts of biological material – can be nondestructive and even noninvasive What are and what are the uses of molecular markers?

Heritable and variable features of DNA, RNA or proteins useful for identification of : • Position in the genome (genetic maps) • Cells and tissues (e.g. cancer) • Individuals (parentage analysis, forensics) • Populations • Species • Taxa above the species level Molecular markers enable inferences about evolutionary and ecological processes What are and what are the uses of molecular markers?

Molecular markers provide diverse biological information about: • Mating systems • Routes and patterns of migration • Centers of genetic diversity within species • Biodiversity of (micro)organisms • Adaptation on the molecular level Many types of molecular markers Choice of molecular markers The mode of inheritance

Biparental: • autosomal Uniparental: • on sex chromosomes W chromosome in birds transmitted only maternally Y chromosome in mammals transmitted only paternally • in organelles Animal mitochondria maternally transmitted, in some plants paternally transmitted Chloroplasts usually maternally transmitted Choice of molecular markers Rate of evolution Depending on the investigated problem, the most useful may be markers evolving quickly : • Individual identification • Relatedness in family groups and populations • Migrations (gene flow) between populations At moderate rate: • History of a particular species • Identification of conservation units • Quick biodiversity assessment (DNA barcodes) Slowly : • Deeper-level phylogeny • of traits and genomes Choice of molecular markers Codominance vs. dominance

Codominant marker– complete individual’s genotype is observed – heterozygotes and homozygotes can be distinguished Very useful, but used to be more difficult to develop than dominant markers Dominant marker – presence of an allele masks other alleles – complete genotype cannot be observed Easy to develop, tens or hundreds may be scored simultaneously, limited information, additional assumptions necessary. Analysis challenging. Used to be popular, now better options available. Choice of molecular markers Technical limitations, cost, fashion

RAPD AFLP sequencing RFLP SNP

allozymes

Popularity minisatellites microsatellites

Schlotterer 2004 Choice of molecular markers Technical limitations, cost, fashion

Seeb et al. 2011 Protein electrophoresis – Allozymes, Isozymes The speed of migration in electric field depends on the protein’s charge The first biochemical technique widely used in population genetics: Drosophila pseudoobscura (Lewontin & Hubby 1966); human (Harris 1966)

CC AB BB AA +

- 1 2 3 4 Protein electrophoresis – Allozymes, Isozymes • Even single amino acid substitutions may change the protein charge and affect its electrophoretic mobility Ca. 30% substitutions cause detectable changes. • Such electrophoretically distinguishable variants of a protein are called alleles • Not all nucleotide substitutions cause amino acid changes (because the genetic code is degenerate) – synonymous and nonsynonymous substitutions • Protein electrophoresis detects only a fraction of genetic variation Protein electrophoresis – Allozymes, Isozymes Tissue or whole organism’s homogenate Electrophoresis Specific staining Starch Enzymes (dozens) Celulose acetate Hemoglobin Acrylamide Transferrin

fot. M. Ratkiewicz Protein electrophoresis – Allozymes, Isozymes Some enzymes encoded by several genes (loci), their protein products can be distinguished in gel Banding pattern – monomeric protein

Locus 3

Locus 2

Locus 1

fot. M. Ratkiewicz Protein electrophoresis – Allozymes, Isozymes Many enzymes are dimers or tetramers – subunits encoded by a single gene Heterozygote pattern – dimeric protein

dimer – 3-band heterozygotes tetramer – 5-band heterozygotes Protein electrophoresis – Allozymes, Isozymes Advantages: • A relatively large number of loci may be examined (30-50) • Inexpensive • Simple Mendelian inheritance • Mature analysis methods • Almost all taxa can be studied

Applications: • Assessment of genetic variation • Genetic differentiation between populations • Studying effects of natural selection Protein electrophoresis – Allozymes, Isozymes Disadvantages: • Usually limited variation • May be under selection (many applications require neutral markers) • Comparison of results between labs/studies difficult • Not amenable to automation • Require fresh or frozen tissue • Almost always destructive, controversial in rare and endangered species Protein electrophoresis – Allozymes, Isozymes • First technique which provided reliable information about levels of genetic variation in nature • Thousands of studies on all taxa • Theoretical and practical significance difficult to overestimate (neutral theory of molecular evolution, molecular clocks, molecular adaptations, parentage analysis, , conservation law enforcement etc.) • Replaced by the DNA-based techniques DNA-based techniques

• Access to the actual genetic variation • Virtually unlimited number of markers • Apply to all organisms • Many kinds/classes, depending on the investigated problem • Almost all utilize PCR • Usually nondestructive • Often noninvasive • Amenable to automation and miniaturization Stages of DNA analysis sampling DNA extraction

Genotyping Amplification (PCR) of selected markers Sampling for genetic analysis

Invasive methods – Non invasive methods – Large amount of high Small amount of poor quality DNA quality DNA • destructive (individual is Collecting biological traces in sacrificed!) the field (feathers, hair, faeces, owl pellets etc.) • nondestructive (sample of blood, tissues, hair, feathers, leaves) Appropriate preservation and storage : drying, freezing, alcohol, buffers, etc. DNA extraction • Disruption of extracellular matrix, disruption/digestion of the cell wall • Mechanical homogenization • Homogenization in liquid nitrogen • Chitinase treatment • Proteinase K digestion • Cell lysis • Detergent in a slightly basic buffer • DNA purification – removal of proteins, lipids, sugars • Organic extraction (phenol/chloroform) or salting-out, followed by the alcohol DNA precipitation • Binding of DNA on silica columns • DNA storage • At +4 or -20 °C, in slightly basic, low concentration buffer with EDTA (chelates Mg 2+ ions) DNA extraction • Many commercial kits • Choice of the extraction method depends on: • Organism • Tissue • Quality of the starting material • Preservation./storage conditions • Required DNA quantity • Required DNA quality • DNA extraction is possible also from the fossil material • ca. 500 ky from ice cores, tens ky from fossils • Only short fragments – tens bp • Very demanding, requires highly specialized, space-quality labs • Multistep controls essential – contamination with contemporary DNA Assessing quality and quantity of extracted DNA Electrophoresis is in low-concentration (0.5 – 1 %) agarose gels DNA stained with ethidium bromide or others, less toxic stains: SYBR Gold, SYBR Safe, Gel Red etc., and photographed under UV

Good quality DNA

DegradedM DNA Assessing quality and quantity of extracted DNA Spectrophotometry Quantity: absorption at 260 nm wavelength Purity: For pure DNA A260/A280 in the range of 1.8 – 2.0, A260/A230 > 2.0 Measurements from 1 ul DNA extraction at concentrations 3 – 3000 ng/ul

Low DNA concentrations may be accurately measured with realtime PCR Or in flouorimeters following staining with fluorescent stains (e.g. PicoGreen) DNA- or RNA-specific measurement How much DNA needed? What about quality? • The number of genome copies in extraction matters • Bacterium ~ 10 6 - 10 7 bp = 1 – 10 fg (1 – 10 x 10 -15 g) DNA • Human cell – ca. 6 x 10 9 bp = 6 pg (6 x 10 -12 g) DNA • Typical „micro” extraction from mammal tissue (200 ul blood, 10 7 cells, 10-25 mg tissue) – 4 – 30 ug (4 – 30 x 10 -6 g) DNA • for 1 PCR reaction 1-100 ng of mammal DNA , one micro extraction enough for > 1000 PCRs • Standard extraction methods, good quality starting material: DNA fragments ~ 10 – 20 kb – sufficient for most appliactions • Even strongly degraded DNA (fragments tens to few hundreds bp) can be in some situations used in PCR-based experiments – increased risk of contamination Whole-genome amplification

• How to get large quantity of DNA and unbiased amplification from a minute starting DNA amount? • Amplified DNA will be only as good as the source (actually slightly worse) • Isothermal amplification employing the φ29 bacteriophage DNA polymerase with strand displacement activity • Priming with random hexamers • Long fragments up to 100 kb • No upper limit to the amplification level • Possible amplification from a single cell • Low error-rate (10 -6)

Lasken & Egholm 2003 RNA extraction

• Expression analysis – single genes and transcriptomes • Identification of Single Nucleotide Polymorphisms (SNP) in functional regions • Expressed Sequence Tags (EST) • Whole transcriptome sequencing and de novo assembly • Transcriptome sequencing provides quickly large amounts of information for nonmodel species with large and complex genomes RNA extraction

• RNA less stable than DNA • RNAses ubiquitous, active and very difficult to inactivate • Tissue must be deep frozen or stored in RNAlater reagent • Homogenization • Extraction methods: • organic extraction: phenol-chloroform-guanidinium thiocyanate, TRIzol, RNAzol • silica and other columns Assessing RNA amount and integrity

• spectrophotometry – similarly as for DNA – RNA amount only • agarose gel electrophoresis – distinct 18S and 28S RNA bands – mainly integrity • lab on chip – Bioanalyzer – amount and integrity RNA profiles on Bioanalyzer

high quality RNA RIN=8.6

partially degraded RNA RIN=6.5

degraded RNA RIN=2.7 DNA profiling – DNA fingerprints

• Total genomic DNA digested with a restriction enzyme • Products separated in the agarose gel – restriction fragment length polymorphism (RFLP) • DNA transferred and bound to a membrane • Hybridization with a labeled probe DNA profiling – DNA fingerprints

• When a minisatellite sequence present in multiple copies in genome is used as the probe, each individual produces a unique banding pattern • Technique ideal for individual identification (unique profiles analogous to fingerprints) • Technically demanding • Requires lots of high quality DNA • Historically important – The first DNA-based technique used for individual identification in the pre-PCR era (from 1985) r.) • Member of a broader family of RFLP techniques based on the Southern hybridization Polymerase chain reaction- PCR

(Still) THE most important technique in molecular ecology • Billions of copies of the precisely defined DNA fragment can be obtained from a complex mixture of DNA fragments, e.g. human genome, • The fragment is defined by primers, short (15-30 bp) single-stranded DNA fragments of a known sequence, which anneal to the target template DNA; primers are chemically synthesized • Amplification may be performed from a minute amount of template, even a single molecule PCR

20-40 cycles

Annealing temperature!!!

animation PCR

• Thermostable Taq polymerase or similar – because denaturation of double-stranded DNA occurs at >90°C • Usually 100 – 5000 bp fragments amplified • Special polymerases and their mixes may amplify fragments up to 20-50 kb • Design of optimal primers – should efficiently amplify DNA, should not be self-complementary or complementary to each other • Conservative and specific primers 5'CTAACTGA CCGTGGATC ATG3' • Degenerate primers 5'CTAACTGA TCGTGGATC GTG3' 5'CTAACTGA YCGTGGATC RTG3‘ = 5'CTAACTGA CCGTGGATC GTG3' 5'CTAACTGA TCGTGGATC ATG3' • Multiple primer pairs to amplify various DNA fragments in a single reaction – multiplex PCR Amplified Fragment Length Polymorphism ( AFLP)

• Replaced RAPD • Good repeatability • Tens or hundreds of bands, • Usually 50 – 500 bp range analyzed • Presence/absence of bands scored (biallelic loci) • Dominant markers – heterozygotes cannot be distinguished from the „band present” homozygotes AFLP

1. Restriction digestion and adapter ligation

+

Hundreds of thousands of fragments AFLP

2. PCR I – preselective amplification, unlabeled primers complementary to the adapter sequence with one selective nucleotide at the 3’ end, 16 x reduction of the number of fragments, usually still too many

3. PCR II – selective amplification, product of the PCR I is amplified with primers containing two additional selective nucleotides– further complexity reduction (256 x)

Only one primer is labeled – only fragments with Eco RI sites will be visible AFLP

• PCR II products used to be resolved in vertical acrylamide gels and visualized with autoradiography or silver-staining • Now usually run in automated DNA sequencers • Tens of bands per primer combination • Up to 64 primer combinations can be used – thousands of AFLP markers for any species Automated DNA analyzer (sequencer)

laser & CCD camera capillaries

gel samples

anode buffer cathode AFLP

Sequences • in single, few or muliple copies in the genome • coding and noncoding • randomly distributed in the genome • moderately or highly polymorphic Applications • individual identification • genetic variation in populations • differentiation among populations (genetic structure) • phylogeny of closely related species • genome scans – detection of genomic fragments underlying adaptations Microsatellites

• Tandem repeats of 2-5 nucleotides • Apparently randomly distributed in the genome • Excellent molecular markers due to extraordinary levels of variation – many alleles in a population • Very high mutation rate (orders of magnitude higher than for substitutions) • Complex mutation mechanisms, DNA polymerase slippage an important mechanism

TA repeats •TCATGTACGTTGATATATATATATATATGTCCTGATGTTA flanking sequences Microsatellites

Mutations often due to polymerase slippage

Ellegren 2004 Microsatellites

• codominant , simple inheritance • high variation – even hundreds of alleles per locus in a population, usually tens; just a few loci enable individual identification with 100% certainty • easy automation • multiple loci may be co-amplified in multiplex PCR • ideal for studying relatedness of individuals • very well developed analytical and statistical framework Microsatellites

Stepwise mutation model (SMM) for microsatellites

The size of the new allele (bp) depends on the sizes of existing alleles For most microsatellite loci mutational process is more complex, „mixed” Microsatellites

Usually genotyped with an automated genetic analyzer PCR with fluorescently-labeled primers – 3-4 colors

gel size standard fragment length fragment

32 individuals Microsatellites

Usually genotyped with an automated genetic analyzer PCR with fluorescently-labeled primers – 3-4 colors

locus 1 locus 2 individual 1

individual 2

capillaries individual 3 signal intensity signal

individual 4

fragment length Microsatellites - drawbacks

• Primers are specific for a single species or a group of closely related species – often must be developed de novo; costly, time-consuming • Complex, mixed mutation model • Size homoplasy (two alleles of an identical size, e.g. 120 bp could be derived from two different alleles: 118 or 122 bp (for 2 bp repeats) • Null alleles – alleles which are not amplified by our primers, resulting in the apparent excess of homozygotes compared to the Hardy-Weinberg expectations Microsatellite null alleles allele ACTGTGCACCTGATCTG(AT) 10 GTCTGTACTGATCCTA TGACACGTGGACTAGAC CAGACATGACTAGGAT √ primer

ACTGTGCACCTGATCTG(AT) 17 GTCTGTACTGATCCTA TGACACGTGGACTAGAC CAGACATGACTAGGAT √

ACTGTGCACCTGATCTG(AT) 12 GTCTGTACTGATCCTA TGACACGTGGACTAGAC CAGACATGACTAGGAT √

ACTGTGCACCTGATCT C(AT) 12 GTCTGTACTGATCCTA ! CAGACATGACTAGGAT null – no amplification TGACACGTGGACTAGAC

ACTGTGCACCTGATCTG(AT) 15 GTCTGTACTGATCCTA TGACACGTGGACTAGAC CAGACATGACTAGGAT √

ACTGTGCACCTGATCTG(AT) 10 GTCTGTACTGATCCTA TGACACGTGGACTAGAC CAGACATGACTAGGAT √ Microsatellites genotyping from noninvasively collected samples • low amount of poor quality DNA may cause nonamplification or allelic dropout • repeated analysis of the same sample necessary

succesful PCR

allelic dropout

Taberlet et al. 1999 DNA sequencing – Sanger method

• sequencing by synthesis and chain termination with dideoxynucleotides

animation DNA sequencing – Sanger method

Chromatogram from an automated sequencer Sanger sequencing

• used for 35 years sequencing reaction • universal labeled terminators • automated • high quality– gold standard capillary • long reads (500-1000 bp) electrophoresis • miniaturization and fluorescent detection paralellization difficult– limited throughput • costly DNA sequencing – Sanger method

Requires purified, homogeneous template, otherwise the sequence will be unreadable

If multiple product in a single amplicon, cloning in bacteria and multiple clones sequenced

Time-consuming and costly DNA sequence variation

• sequencing gives access to any part of the genome • in the anslysis of DNA variation in a population the same (homologous) nucleotide positions are compared between individuals – sequence alignment

good alignment nucleotide homologous seq position poor alignment homologous seq not homologous sequences DNA sequencing

• phylogeny reconstruction • genomics • phylogeography • biodiversity assessment • experimental evolution • forensics • medicine The role of DNA sequencing is increasing steadily, new technologies are increasing output and decreasing cost – personal genomics but also amazing prospects for the study of non-model organisms Next Generation Sequencing

• Diverse technologies, billions of bp sequenced in a single run – whole eukaryotic genomes • Complex DNA mixture can be sequenced • Does not require preparation of individual templates • Usually short reads • Processing requires complex bioinformatic resources, both hardware and software • Since 2005, new technologies are being introduced Stages

• Template preparation • Fragmentation and fragment selection • Attaching adaptors • Selection of single stranded fragments • (Clonal amplification) • required to increase signal strength • some technologies directly sequence single DNA molecules • Sequencing • sequencing by synthesis • sequencing by ligation • diverse approaches to signal detection 454

adaptor ligation, DNA shearing fragment selection and denaturation

binding to beads Single Fragment– Single Bead clonally amplified DNA ca. 10 7 copies

emulsion PCR

microreactors

Margulies i in. 2005 454 pyrosequencing

Margulies i in. 2005 454

deposition of beads

optic-fibre plate 3.2 million wells enzymes added Sequencer

reagents one dNTP per cycle

>300 cycles

TCAGGTGGCATTGCCGCCTGGCCTTGTGCGA Margulies i in. 2005 reads 600 - 700 bp, ca. 1 mln sequences, 6-7 x 10 6 bp Metzker 2010 454

• The oldest NGS technology • Long reads • Relatively fast (10-20 h) • High cost • Problems with homopolymer regions

www.454.com New generation sequencing – Ion Torrent • The principle similar to 454, but instead of ligh, pH change caused by nucleotide incorporation is measured electronically • Instead of a glass-fibre plate, semiconductor chip • Cheap and scalable: • No modified nucleotides • Electronic signal detection – no need for imaging • Rapid improvement, currently > 200 bp reads, 10 Gb per run New generation sequencing - Illumina

Bridge amplification on a glass slide

Metzker 2010

Hundreds of millions of clusters each with >10 3 identical (derived from a single molecule) DNA molecules Illumina – bridge amplification

Primers attached Single-stranded template Replication of the to glass slide with adaptors added template

Denaturation – Buffer change – Elongation template attached annealing to (replication) to the slide immoblized primer Soldatov 2008 Illumina – bridge amplification

Denaturation Buffer change – Elongation annealing to (replication) immoblized primer

Denaturation Elongation After 35 cycles >10 3 One primer cut – cluster of (repliaction) molecules clonal, single stranded Soldatov 2008 DNA fragments New generation sequencing - Illumina

Bridge amplification on a glass slide

Metzker 2010

Hundreds of millions of clusters each with >10 3 identical (derived from a single molecule) DNA molecules New generation sequencing - Illumina

Sequencing with reversible terminators

• Four nucleotides are added per cycle, each acting as a terminator and fluorescently labelled are added– single bp extension • Washing-out unused terminators and imaging • Dye and terminator are cleaved – chain can be extended again

Metzker 2010 Illumina • Up to 600 bln (6 x 10 11 ) bp in a single run • Each DNA fragment may be sequenced from both ends – partially overcomes limitations of short reads, 2 x 100 pz • 1.5 -11 days depending on read length : 36 – 100 (150) bp and whether single-end or paired-end sequencing • Most frequent errors – base substitutions • Most important NGS technology

100 100 100 100

100 100 100 100 100 100 100 100 www.illumina.com Single molecule sequencing - PacBio

• DNA polymerase molecules immobilized in very small cells – zero-mode waveguides (ZMV) • Incorporation of a nucleotide to the DNA strand generates a flash of color light which is recorded in real-time • Real single molecule sequencing – no amplification Pacific Biosciences

Metzker 2010 Single molecule sequencing - PacBio

• Long reads > 1000 bp • Very fast sequencing (minutes), but still limited throughput • ZMV on disposable chips (SMRT cells) • Huge error-rate (ca. 15% for single reads) – analysis challenging, still at early stages • Very complex, massive and expensive instruments „Personal” sekwenatory

• cheaper, smaller versions of sequencing instruments tailored to the needs of single labs • lower throughput • cheaper • faster

GS Junior MiSeq Ion Torrent Roche (454) Illumina Life Technologies Comparison of platforms - throughput

Instrument Run[h] Reads Read Mb per [mln] length [bp] run 454 FLX+ 20 1 650 650 GS Junior 10 0.1 400 40 Illumina HS2000/2500 240 3000 100+100 600 000 MiSeq 65 22 300+300 13 000 Ion Torrent 318 7 4 400 1500 Ion Proton 4 70 200 10 000 Pacific 2 0.03 >3 000 100-150 Comparison of platforms – cost ($)

Instrument Reagents/ $$$/Mb Per unit cost run (% of full run) 454 FLX+ 6200 7 2000 (12%) GS Junior 1100 22 1500 (100%) Illumina HS2000 23500 0.04 2400 (6%) MiSeq 1400 0.11 1800 (100%) Ion Torrent 940 0.60 1200 (100%) Ion Proton 1050 0.09 1300 (100%) Pacific >300 2-17 500 (100%) Future of NGS?

• Sequencing of really long DNA fragments would facilitate assembling • Nanopores • Optical maping • Multiple technologies do and will coexist

Niedringhaus i in. 2011 Costs drop rapidly

human genome sequencing cost

www.genome.gov

Data flood

• Huge amount of sequence data may be obtained quicly and cheaply • Analysis and storage costs the most important bottleneck • World is full of underexplored sequence data • Bioinformaticians Data analysis – NGS bioinformatics

• De novo sequencing – we want to determine genome sequence of a species with no reference genome available – ASSEMBLY • Resequencing – genome sequence is available (reference genome ) – MAPPING, resequencing : • genomes of various individuals , • tumor vs. normal , • specific parts of the genome: protein coding , transcription factor binding ,histone binding , methylated etc. • particular genes , eg. BRAC1 gene – breast cancer risk You can’t assess data by eye - QC • Automated methods to assess data quality • Basecallers – determine bases and assign quality values

• Phred-scores QV = -10 log 10 P; P – brobability of incorrect basecall; 10 – 1/10 chance of error, 20 – 1/100 chance of error, 40 – 1/10000 chance of error • Trimming low quality bases Genome assembly would be

• relatively simple if: • reads were long – are short (100-1000 bp) • there were no sequencing errors – errors relatively common (0.1-10%) • genomes did not contain repeats – eukaryotic genomes contain large amount of repetitive sequences • therefore de novo assembly is challenging and requires additional information • cloning in large-insert vectors • mate-pair libraries Repetitive elements in genomes 2 kb very similar or identical sequences, 3 repeats

unique sequences

• repetitive elements originate by duplication • duplications may be related to meiotic recombination or the activity of mobile elements • the result is the presence similar or identical sequences in various parts of the genome - paralogs De novo assembly 2 kb very similar or identical sequences, 3 repeats

unique sequences ???

• if reads are short (100, 200 … bp)the can’t be assigned to the specific repeat -> genomic region can’t be assembled • solution: anchor reads in unique sequence – mate-pairs • genomic DNA is fragmented into large pieces of the known size (3 kb, 5 kb, 10 kb) – for each piece both ends are sequenced using short reads – we know the distance between ends Mate-pair libraries 2 kb very similar or identical sequences, 3 repeats

unique sequences Resekwencjonowanie - mapowanie

• odczyty są mapowane, „przypasowywane” do sekwencji referencyjnej – dużo prostsze obliczeniowo, istnieją wydajne algorytmy • wyniki mapowania w formacie SAM lub BAM (binarny SAM) Resequencing- mapping Resequencing - variation

• Differences with respect to reference – takes into account coverage, mapping quality, sequencing error – probabilistic model – results saved in the Variant Call Format (VCF) file Genome browsers

• UCSC • ENSEMBL • NCBI MapViewer • annotated genome • visualisation of the genome • adding custom info Single Nucleotide Polymorphisms (SNP) Single Nucleotide Polymorphisms (SNP)

• In the human genome ca 10-30 mln SNPs with minor allele frequency > 1% • Both coding and non-coding regions • Many do not affect organism health or performance • But some SNPs correlate with disease or other organism features • Many genotyping technologies including microarrays enabling simultaneous genotyping of millions of SNPs • Efficient genotyping requires knowledge of the genomic context Single Nucleotide Polymorphisms (SNP)

Microsequencing (e.g. SNaPshot)

All nucleotides added to reaction are terminators Single-base extension Electrophoresis of fluorescently labeled extension product

Genotype Multiplex GG homozygote Single Nucleotide Polymorphisms (SNP)

Microsequencing (e.g. MassArray) • Only terminators • Single-bp extension • Simultaneous amplification and microsequencing of up to 36 SNPs • Mass of the microsequencing products determined by spectroscopy • For each SNP, the mass of possible extension products is precisely known • Mass of all products determined -> genotyping Single Nucleotide Polymorphisms (SNP)

Golden Gate – hundreds of SNP genotyped in a single reaction

Fan et al. 2006 Single Nucleotide Polymorphisms (SNP)

Golden Gate hundreds of SNP genotyped in a single reaction Single Nucleotide Polymorphisms (SNP)

Commercially available microarrays (Affymetrix, Illumina) enable simultaneous genotyping of millions of SNPs in human and model species Single Nucleotide Polymorphisms (SNP)

• All these genotyping techniques require prior SNP discovery – only known SNPs can be genotyped • SNP discovery is performed by resequencing of genomic regions or whole genomes in a limited number of individuals – discovery panel • Ascertainment bias – identified SNPs are dependent on the discovery panel • Ascertainment bias can be a serious problem in association studies – correction difficult • Whole genome resequencing, targeted resequencing RAD-tags

• A fraction of genome , fragments adjacent to restriction sites, is sequenced for all individuals • SNPs are discovered in sequenced fragments • Individuals can be barcoded with short sequence tags and pooled for sequencing • Number of assayed SNPs determined by genome size and restriction enzyme used • Nsi I 342 cutting sites / Mb • Not I 1 cutting site / Mb and many other enzymes • No ascertainment bias • More powerful if reference genome available RAD-tags

Illumina sequencing RAD-tags

Hohenlohe et al. 2010 RAD-tags

• A fraction of genome, fragments adjacent to restriction sites, is sequenced for all individuals • SNPs are discovered in sequenced fragments • Individuals can be barcoded with short sequence tags and pooled for sequencing • Number of assayed SNPs determined by genome size and restriction enzyme used • Nsi I 342 cutting sites / Mb • Not I 1 cutting site / Mb, and many other enzymes • No ascertainment bias • More powerful if reference genome available SSCP (Single Strand Conformational Polymorphism)

SSCP is used for detecting mutations and genotyping of defined genomic fragments (PCR amplified) of identical length but differing in sequence

Single stranded DNA fragments of 100-400 bp have sequence- depended conformation, the rate of gel migration depends on the fragment conformation Single bp substitutions can be detected Targeted sequence capture

• Sequencing of multiple whole eukaryotic genomes still expensive • Data analysis and storage even more expensive and demanding • Often, only a fraction of the genome is of interest, e.g. all human exons comprise only 1.2% of the genome • These fragments of interest can be selectively captured on arrays with appropriate probes attached • Requires the knowledge of the genome sequence or at least sequence of the fragments we need • Captured fragments can be then sequenced • Tens of kb to tens of Mb Mamanova et al. 2010 Molecular biology techniques can be fully automated

Tens, hundreds, thousand of markers can be quickly genotyped in thousands of individuals Utility of molecular markers for various research problems

allozymes AFLP microsatellites SNP Sequences cpDNA and of nuclear mtDNA genes sequences Individual + ++ +++ + - + identification Genetic ++ + +++ ++ + - variation Genetic ++ + +++ +++ ++ ++ differentiation Gene flow + + +++ ++ + + (migration) Phylogeography + + + ++ +++ +++

Hybridization ++ ++ ++ +++ +++ +++ Eukaryotes possess several genomes

nDNA cpDNA mtDNA(A) mtDNA(P) Inher. ♀ ♀ (A), ♂(G) ♀ ♀ Structure Linear Circular Circular Circular Mutation Rate moderate low high low

Molecular markers located in each genome are useful for addressing various problems Applications of molecular methods (1)

• Identification of: • Sex • Individuals (also from biological traces) • Parentage • Species • Genetic variation and factors influencing it • Measures of variation • Variation and population processes and history • Gene flow and genetic structure of populations • Analysing dispersion with molecular markers • Measures of genetic structure and their biological interpretation Applications of molecular methods (2)

• Gene genealogies and genealogical inference • Historical demography • Gene flow among populations • Phylogeography • Conservation genetics – the application of molecular markers for conservation purposes • Genetically modified organisms and ecosystems • Adaptation at the molecular level Molecular sexing

Necessary in cases of: • the lack of sexual dimorphism • for determination of sex ratio in offspring • juveniles, larvae etc. • analysis of faeces and other traces

Sex-specific markers are not universal across taxa Molecular sexing Birds • Genetic sex identification is species without sexual dymorphism • Primary sex ration (for embryous)

CHD genes

Intron A of the CHD1-Z and CHD1-W genes

ZZ ZZ ZZ ZW ZW ZW ZW start 400 bp 200 bp

MALES FEMALES Molecular sexing Mammals • Genetic sex identification from biological traces (e.g. faeces) • Sex identification for juveniles

SRY gene Y chromosome microsatellites

A gene occuring in both sexes Zfy/Zfx acts as the positve control Sry (successful PCR)

Bryja & Konecny 2003 ♂ ♀ ♂ ♀ Individual identification

Even small number of polymorphic loci may form a large number of possible genotypes

n2 (n +1)2 G = (for 2 loci) 4 n = number of alleles; for n = 4, G = 100

The number of possible genotypes for n bi-allelic loci is 3 n Polymorphic molecular markers ideal for individual identification Individual identification

Distinguishing individuals in clonal organisms

Honey mushroom Armillaria bulbosa , a single clone may cover 15 ha, weight 10 ton, age 1500 years; RAPD-based analysis

Smith et al. 1992 Determination of the genetic mating system

Aspects of mating systems: • Which individuals mate with which • How many individuals reproduce? • How long individuals stay togehter? • Who’s caring for offspring?

? Individual reproductive success

+ molecular markers

Behavioral system vs. genetic mating system Behavioral vs genetic mating system

Behavioral mating system – based on direct observation of pairing and mating

Genetic mating system – molecular identification of reproducing individuals and parentage of young individuals

? Behavioral mating system = Genetic mating system

Genetic identification of parents is essential Parentage analysis

Several approaches, simplest and most frequently used: • Exclusion Principles of Mendelian inheritance – in diploid species each parent has exactly one allele in common with each child in each autosomal locus • Categorical identifications For each potential parent (from the pool of non-excluded) the probability of parentage is calculated and the individual with the highest probability is considered a parent; probability of parentage is genotype-dependent Identification most straightforward if the genotype of one parent known (e.g. seeds collected from a plant) Parental genotypes may be inferred from the offspring genotypes Exclusion

locus 1 locus 2 locus 3 locus 4

male

female (mother)

chick 1

chick 2 Parentage analysis

• Highly polymorphic microsatellites are best, in some cases SNP, exceptionally AFLP or allozymes • Markers should be highly variable • The number of markers should be substantial, depends on the level of polymorphism, 5-30 microsatellites but > 50 SNPs • Genotyping must be reliable, genotyping error must be estimated and should be low • Unlinked loci • Appropriate sampling in the field essential, as high as possible fraction of potential parents • Biological characteristics of the species must be taken into account (e.g. brood size) • Many computer programs: Cervus, FAP, Pasos, Colony etc. Extra-pair parentage

• Detected in 90% of socially mongamous bird species • In these species on average 11% offspring derived from extra pair matings – high variance, multiple factors involved • Reed bunting ( Emberiza schoeniclus ) – female’s social partner was not the father of 55 % chicks, 86% had chicks from extra pair matings • From the mid 80s minisatellites • Now microsatellites: several, a dozen of so loci • Very high efficiency and precision Individual identification

• Criminal cases • Identification of human remains (missing persons, catastrophies) • Parentage analysis Marker choice • Markers used for human identification • Nuclear DNA • Microsatellites (STR) – autosomal • Microsatellites (STR) – on sex chromosomes (Y-STR, X-STR) • Mitochondrial DNA markers (hair, bones) • HV1, HV2 • Comparative material in the case determines the choice of markers and the strength of evidence • Sample from the suspect/victim • Personal belongings of the missing person (e.g. toothbrush) • Samples from both parents (all markers) • Only from mother (nuclear DNA + mtDNA) • Only from father (nuclear DNA including Y-STR) • More distant relatives – maternal and paternal • Offspring • Animal traces • Cytochrome b, microsatellites STR (Short Tandem Repeats) = microsatellites Identification of Mikołaj Kopernik’s (Nicolaus Copernicus) remains • Bones found in the Frombork cathedral • Face reconstruction based on skull • DNA profile from bones and teeth • The profile was identical to the profile obtained from hair in the book known to be Kopernik’s property

Bogdanowicz et al. 2009 Identification of species and their hybrids

There is no (and won’t be?) universally accepted species concept

Biological species concept vs. other definitions

Many species are morphologically indistinguishable (sibling, cryptic species), for microorganisms the problem is absolutely crucial Identification of species and their hybrids

• Species have unique, genetically based characteristics -> species differ genetically • Diagnostic marker – a given species has a unique, species specific allele(s) • Identification of interspecific hybrids is straightforward with diagnostic markers

hybrid Allelele dignostic for species B Allele diagnostic for species A Identification of species and their hybrids

• If hybrids are fertile and backcross to parental species, they may be heterozygous only in some diagnostic loci • The more diagnostic markers the better ; individuals with a small proportion of genome derived from other species can be identified • Uniparentally inherited markers enable determining direction of hybridization , e.g. ♀A x ♂B but not ♀B x ♂A mtDNA Identification of species and their hybrids

Methods which estimate, on the basis of a multilocus genotype, the fraction of the individual’s genome derived from each parental species

Based on probabilistic models and Bayes theorem – e.g. STRUCTURE

Lecis et al. 2006 DNA barcoding

• Standardized procedure of species identification through short (100-1000 bp), unique DNA sequences • Appropriate gene • Conservative primers – conservative flanking regions • Sufficient variation in the target region • Present in all species of a large taxonomic group (bacteria, Archaea, plants, animals, fungi) DNA barcoding

Hajibabei et al. 2007

• Sequence database is assembled for a taxonomic group (identification by an expert taxonomist) • Sequence from unknown organisms compared with the database • Full automation in the future? DNA barcoding

http://barcoding.si.edu http://www.barcodinglife.org

• Despite initial criticism, very good performance demonstrated in multiple groups (ca. 95% success rate) • Database assembling is a major limitation • But possible also without databases... DNA barcoding and NGS

• Sanger sequencing slow and expensive • DNA from environmental samples amplified with universal primers require cloning to separate sequences from various species – even more cumbersome • New generation sequencing provides millions of sequencing reads derived from single DNA molecules (cell-free cloning) – quick analysis of entire ecosystems possible

Valentini et al. 2009 DNA barcoding – applications

• Species identification in poorly known groups with subtle morphological differences: dipterans, lepidopterans, nematodes (less than 1% species described on the basis of morphology) • Species identification from biological traces: hair, faeces etc. – monitoring of endangered species: Siberian tiger, the Amur leopard • Species identification in highly processed products – meat, leather, caviar – monitoring of trade in endangered species DNA barcoding – applications

• palaeoecology: DNA from the 450 ky-old ice core suggests the presence of boreal forests in Greenland • Diet (and vegetation changes) of an extinct giant sloth (10 – 30 kya) • Diet of present-day species – analysis of stomach content or faeces

Valentini et al. 2009 Environmental genomics and DNA taxonomy • Ribosomal RNA (rRNA) genes enable identification of microorganisms • Vast majority of microorganisms cannot be cultured, poorly known, the number of forms unknown • Sequencing of soil, sea water, sea floor deposits • Sequences of the defined rRNA fragments used to the determine the number of the Molecular Operational Taxonomic Units (MOTU) – sequence divergence thresholds • „Rare biosphere”

Huber et al. 2007 DNA barcoding and DNA taxonomy challenges • A single genomic fragment – hybridization, introgression, incomplete lineage sorting pose problems • Divergence threshold used for MOTU designation may not reflect biological reality • The ratio of intraspecific to interspecfic variation usually poorly known • The length of the barcodes, standard barcodes> 500 bp, amplification of DNA fragment > 150 bp often impossible from partially degraded DNA – resolution compromised DNA biodiversity estimates - controversies Searching sequence databases

• NCBI – GenBank and other databases • BLAST – quick identification of similar sequences • Searching through genomes • Searching protein databases Free access to the worldwide genetic resources

Enormous amount of sequence data are being generated every day – high chance that the desired fragment is already in databases– start there ! Terms & definitions • Locus (pl. loci )– the place on a chromosome where a given DNA segment, e.g. a given gene, is located, often used interchangeably with gene • Allele – a variant of the gene distinguishable from other variants, sometimes used as the name for a gene copy – context-dependent • In population there may be multiple alleles in a locus • A diploid individual has maximally two different alleles • Gene copy – used for counting genes, we are not interested in the allelic state, • A diploid individual has two copies of each autosomal gene • In a population of N diploid individuals there are 2 N copies of each autosomal gene • Phenotype – a characteristic of an organism or a group of organisms • Eye, hair color, blood group • Genotype – genetic makeup in one or more genes

• At locus A an individual may by the A1A1 homozygote or A1A2 heterozygote • Haplotype – DNA segment (usually non-recombining) Gene (locus) and allele

• The place on a chromosome where the gene is located is called locus • Gene variant present at a given locus on a particular chromosome is called allele • At each locus one allele (gene copy) is inherited Allele A from each parent – exceptions: mtDNA, cpDNA and ABO blood sex chromosomes Homologous group gene chromosomes • Alleles (gene copies) may be (locus) identical (homozygote) or different (heterozygote)

• Allelic composition of an individual Allele B at a locus is its genotype heterozygote AB – blood group AB Terms & definitions

• Genotype frequency – proportion of a given genotype among assayed individuals (in the studied population)

• Two alleles A1 and A2, diploid species, autosomal gene P = NA1A1 /N, H = NA1A2 /N, R = NA2A2 /N • Allele frequency – proportion of a given allele among all assayed gene copies

• Two alleles A1 and A2, diploid species, autosomal gene, homozygote frequency plus the half of heterozygote frequencies (heterozygotes have only a single copy of a given allele): p = P + 1/2 H, q = Q + 1/2 H, q = 1 - p

number of alleles (gene copies) of a given type divided by the total number of alleles (gene copies) assayed

p =N A1 /2 N, q = NA2 /2 N Measures of genetic variation and parameters useful for inferring population processes

• Allele frequencies at assayed loci • Proportion of polymorphic loci ( P ) • Mean number of alleles per locus ( A ), allelic richness ( R) • Mean observed and expected heterozygosity, haplotype

diversity (HO,HE, h ) • Nucleotide diversity (π) • Departures from the Hardy-Weinberg equilibrium (HWE) • Linkage disequilibrium (LD) Proportion of polymorphic loci ( P )

P = number of polymorphic loci/total number of assayed loci

Depends on the polymorphism criterion, e.g. 95%

No genetic variation at 47 protein loci in the chhetah (P = 0.0) Mean number of alleles per locus ( A )

A = total number of alleles/number of assayed loci Sample-size sensitive Allelic richness R = mean number of alleles per locus standardized to the smallest sample size

Most variable loci may have hundreds of alleles: • microsatellites • vertebrate major histocompatibility complex (MHC) genes Hardy-Weinberg principle (equilibrium) Relationship between allele and genotype frequencies

Two alleles with frequencies p (A1) and q (A2) Genotype frequencies:

A1A1 A1A2 A2A2 p2 2pq q2 No evolution, both frequencies will remain constant if: • very large population (-> infinity) • random mating • no mutation • no migration • no selection

Equilibrium genotype frequencies after one round of random mating Genotype frequencies in natural populations usually similar to H- W expectations The null hypothesis in population genetics Mean observed and expected heterozygosity

(HO i HE)

HO = proportion of heterozygous individuals in the sample

HE = proportion of heterozygotes expected from the Hardy-Weinberg principle Both can be computed for single loci and averaged across loci

A1A1: 4 individuals

A1A2: 1 individual

A2A2: 0 individuals Ho = 1/5 = 0.20 f(A1) = p = 0.9 f(A2) = q = 0.1 He = [2(0.9 x 0.1)] = 0.18 Nucleotide diversity

Nucleotide diversity (π) – fraction of nucleotide positions differing between a pair of homologous DNA sequences picked randomly from the population; mean from all pairwise comparisons; heterozygosity at the nucleotide level 1 2 3 4 5 6 7 8, 9, 10 2 0,13 n n π π 3 0,59 0,55 ∑ ij ∑ ∑ ij < == + π = i j = i 1ij 1 4 0,67 0,63 0,25 ()− n n n 1 5 0,80 0,84 0,55 0,46   2 2 6 0,80 0,67 0,38 0,46 0,59 π 7 0,84 0,71 0,50 0,59 0,63 0,21 ij fraction of nucleotide positions differing between i and j 8, 9, 10 1,13 1,10 0,88 0,97 0,59 0,59 0,38 11 1,12 1,18 0,97 1,05 0,84 0,67 0,46 0,42

π = 0.0065= 0.65% Haplotype diversity

• Expected heterozygosity (HE) is calculated from allele frequencies n = − 2 H E 1 ∑ pi i=1 • Therefore, organism/locus does not need to be

diploid to compute HE • In the context of haploid species or markers (mtDNA,

cpDNA, Y chromosome) HE is often termed haplotype diversity (h) and may be considered a measure of genetic diversity Departures from Hardy-Weinberg (H-W) equilibrium

Observed Expected from H-W

2 2 A1A1: 4 individuals p = 0.9 = 0.81 -> 4.05

A1A2: 1 individual 2pq = 2x0.1x0.9 = 0.18 -> 0.9

2 2 A2A2: 0 individuals q = 0.1 = 0.01 -> 0.05 f(A1) = p = 0.9 f(A2) = q = 0.1 Very good agreement with H-W expectations Departures from H-W are usually tested with the Fisher’s exact test or its Monte Carlo versions Causes of departures from H-W proportions

• Nonrandom mating • Inbreeding • Self-fertilization • Clonal reproduction • Wahlund effect (multiple populations differing in allele frequencies are considered a single population, heterozygote deficit) • Population structure • Contact zone of genetically differentiated forms • Selection • Null alleles Wahlund effect

Two isolated populations, each with H-W genotype proportions, differing in allele frequencies. The researcher (erroneously) assumes that (s)he is sampling a single panmictic population Population I Population II p = 0.9 q = 0.1 p = 0.1 q = 0.9 Proportion of Proportion of heterozygotes heterozygotes 2pq = 2x0.9x0.1 = 0.18 2pq = 2x0.1x0.9 = 0.18 Our sample

po = 0.5 qo = 0.5 Expected heterozygosity

2poqo = 2 x 0.5 x 0.5 = 0.5 Observed heterozygosity 0.18

Apparent deficit of heterozygotes caused by subdivision Genetic drift

x

x

x czas x

x Drift stronger in small populations ~1/ Ne Effective population size ( Ne)

• Ne is the size of ideal population (constant size, nonoverlapping generations, hermaphroditism, the average reproductive success identical for all classes of individuals), which would loose genetic variation at the same rate as the examined population • Usually much smaller than the census population size ( N) because: • Not all individuals reproduce • Variation among individuals in reproductive success • Sex ratio among reproducing individuals usually departs from 1:1 • Fluctuation in the population size over generations

Frankham et al. 2010 Ne affected by sex ratio

Sex ratio 1:1 1.0 4N N N = m f e + N f Nm Ne/N

0.5

% males 100% Ne affected by sex ratio • The northern elephant seal Mirounga angustirostris ; harem breeding system, almost extirpated at the end of 19th century, it is possible that all females were fertilized be a single or a very few males • If 100 females and 1 male reproduce, then Ne = 4x1x100/(100+1) ≈ 4 • Ne affected by the strength of sexual selection • If sexual selection is stronger in one sex, then effective population size of the parts of the genome transmitted by this sex will be lower – here Y-chromosome Ne lower than mtDNA Ne Mating system affects Ne

The house cat – feral populations Ne/ N Promiscous 0.42 Polygynic 0.33

Kaeuffer et al. 2004 Ne affected by population size changes

• Population size changes are the most important factor affecting the long-term Ne = t Ne ∑ 1 i Ni • Harmonic mean of the population sizes across generations • Harmonic mean smaller than the arithmetic mean • For population sizes across generations: 1000, 700, 200, 15, 100 arithmetic mean equals 403 but Ne (harmonic mean) equals 59 Estimating Ne with molecular methods

• Loss of heterozygosity in molecular markers across generations • Changes in allele frequencies across generations • Decrease in allelic richness • Long-term Ne can be estimated from population samples of DNA sequences The bottleneck effect

Low microsatellite variation in the isolated populations of the Alpine newt from Góry Świ ętokrzyskie (the Holy Cross Mts.) czas

Pabijan & Babik 2006 The bottleneck effect

Strong selection may maintain variation even in very small populations, counteracting drift

The island fox Urocyon littoralis offcoast California The most monomorphic animal – lack of variation in microsatellites and DNA fingerprints Polymorphic MHC genes – immunity to pathogens? MHC-based mating? – variation maintained by selection Aguilar et al. 2004 How much variation is lost depends on:

• Strength of the bottleneck • Duration of the bottleneck • Post-bottleneck gene flow

The European bison – restored from 12 (7) founders, but still some variation in microsatellites and MHC maintained; the single- generation bottleneck Linkage disequilibrium

• Linkage disequilibrium = nonrandom association of alleles at different loci: • physical linkage • admixture (migration, hybridization) • natural selection • drift • under free recombination LD is expected to decay exponentially – its effects are longer lasting than departures from H-W in a single locus – informative Substantial population size is necessary for the maintenance of variation

On the shorter time-scale this prevents the loss of variation due to drift On the longer time-scale retention of the adaptive potential – standing genetic variation important for adaptation

Genetic variation is necessary for natural selection Local populations of almost all species show some differentiation – genetic structure The simplest reason: limited dispersal abilities and phliopatry

Selander’s studies in 1970s of the mice populations from multiple barns in a single farm Factors affecting genetic structure of populations

• Gene flow (migration) reduces genetic differentiation, • Drift increases differentiation • Selection may either increase (local adaptations) or decrease (balancing selection) differentiation • Mutations increase differentiation Various expectations for various groups Plants: - Pollinated by insects of limited mobility: higher differentiation among populations, low variation in populations - Wind-pollinated: low differentiation among populations Birds and some insects with high dispersal abilities usually have weaker genetic structure than less mobile animals (like amphibians) F statistics and population subdivision

Heterozygosity on three levels

Biallelic locus (easily extended to multiallelic):

HI – observed heterozygosity (from genotyping)

HS – expected heterozygosity in i-th subpopulation, computed from its allele frequencies in

HS =2 piqi

HT – expected heterozygosity of the entire population, computed from the average (across subpopulations) allele frequencies

HT = 2 poqo F statistics and population subdivision

Departures from the expected heterozygosity due to the nonrandom mating within subpopulations

− = − H I = HS H I FIS 1 H S HS F statistics and population subdivision

Loss of heterozygosity due to the action of drift in subpopulations Measure of genetic differentiation between populations Variance of allele frequencies across subpopulations − = − H S = H T HS FST 1 HT H T F statistics and population subdivision

Loss of heterozygosity due to the nonrandom mating within populations and population subdivision

FIT less important than FIS and FST , no independent information, less intuitive − = − H I = H T H I FIT 1 HT H T Wright’s (1978) rules of the thumb for FST

0.00 - 0.05: low differentiation 0.05 - 0.15: moderate 0.15 - 0.25: high 0.25 - 1.00: very high

Caution!!! The maximum FST depends on the within population variation

Max FST = 1-HS Wright’s scale useless for modern analyses

Standardized FST FST and detecting genes underlying local adaptations

Drift affects all loci; there is a genome-wide

FST

Loci underlying local adaptations or linked to such loci would have higher FST

Genomic scans Outlier loci

Taberlet 2007 Studying dispersion in natural populations

• Direct methods (field studies) • Direct observation of individuals in the field • Marking/tagging-based methods: capture – mark – recapture (CMR) • Telemetry • Fenced areas with monitored exits • Indirect methods utilizing molecular techniques Drawbacks of direct methods

• Limited area • Marking/tagging challenging in most species • Difficulties with tagging all individuals • Difficulties with recapture – emigration? Death? • Direct methods measure movements; usually no information about the reproductive success • Various direct methods provide very different estimates Which direct method is the best? Discrepancies between methods

Fraction of marmots ( Marmota flaviventris ) identified With CMR (empty bars) and telemetry (hashed bars)

(a) MALES FEMALES (%) Identified individuals

Distance (km)

MALES - mean dispersion : 0,77 km (CMR); 2,56 km (telemetry) FEMALES – mean dispersion: 0,51 km (CMR); 1,44 km (telemetry) Koenig et al. 1996 Indirect methods

• Identification of migrants and their progeny based on multilocus genotypes • Requires genetic differentiation among populations, subtle differences are enough • Probabilistic interpretation • Descendants of migrants may be identified in further generations • Resolution depends on the number of markers and the extent of differentiation among populations • Markers: microsatellites, AFLP , SNP Identification of migrants

Methods which estimate, on the basis of a multilocus genotype, the fraction of the individual’s genome derived from each population

Based on probabilistic models and Bayes theorem – e.g. STRUCTURE – useful both in identification of interspecific hybrids and gene flow between slightly differentiated populations

Lecis et al. 2006 Sex-biased dispersion

• Biparentaly inherited markers (allozymes, autosomal microsatellites, AFLP, autosomal SNP) • Comparing males and females after dispersion but before reproduction • Signal disseappears in progeny • Comparison of markers with contrasting modes of inheritance • Comparing of uniparentally inherited markers with autosomal markers • Comparison of maternally and paternally transmitted markers Sex-biased dispersion and FST

Comparison of genetic differentiation among populations ( FST ) calculated from autosomal markers separately for each sex

Species females males source

White-toothed shrew 0,025 0,145 Balloux i in. 1998 Crocidura russula Canadian otter 0,131 0,064 Blundell i in. 2002 Lontra canadensis Sea legwan 0,120 0,090 Rassmann i in. 1997 Amblyrchynchus cristalus

Dispersing sex (smaller population differentiation) Sex biased dispersion – markers with different inheritance modes

MARKERS: • mtDNA • genes (e.g. SRY ) or Y-chromosome microsatellites

COMPARISONS: Variation in uniparentally inherited markers (mtDNA, SRY ) ↓↑ Variation in autosomal markers (allozymes, microsatellites)

Variation in maternally transmitted markers (mtDNA) ↓↑ Variation in paternally transmitted markers (e.g. SRY ) Uniparentally inherited vs. autosomal markers

Comparison of FST for mtDNA and microsatellites

Species mtDNA msats Source

Mouse-eared bat 0,961 0,015 Kerth i in. 2002 Myotis bechsteini (males) Jaguar 0,342 0,045 Eizirik i in. 2001 Pantera onca (males) Common shrew 0,030 0,103 Balloux i in. 2000 Sorex araneus (females)

Dispersing sex Comparison of direct and indirect methods for studying dispersion

Direct methods Measure mobility of organism during the course of study

Indirect methods Measure effective dispersion , with successful reproduction, able to detect signals of the historical dispersion Comparison of direct and indirect methods for studying dispersion

CMR

Indirect method 1

Indirect method 2

Indirect method 3 Percentage of individuals dispersing Percentage

Capture site

Berry et al. 2004

CMR > 7 years Indirect methods < 3 months Dispersion – conclusions

• Direct methods have serious limitations • Molecular methods enable identification of migrants as well as the effective migration • Several approaches for the study of sex-biased dispersion • The application of molecular markers saves time and effort • Wherever possible, direct and indirect methods should be combined FST as a measure of gene flow

≈ Nm (1- FST ) / 4 FST Nm = the equilibrium per-generation number of effective migrants between two populations

If Nm < 1 , genetic differences between populations will be fixed

Relationship valid under the island migration model

Multiple assumptions, which are rarely meet in nature

Nm estimates calculated from FST should be treated with extreme caution Genetic structure in fishes Increased from marine, through anadromous to freshwater fish - 1.6, 3.7 i 29.4% - in general freshwater environments much more fragmented Isolation by distance (IBD)

Genetic differentiation (genetic distance) between populations correlates with geographic distance • Usually species range much larger than individual disperal abilities • Geographically closer populations are often more similar genetically (philopatry, recent common history) • Model relevant for populations at migration-drift equilibrium Isolation by distance (IBD) Correlation between estimated gene flow and geographic distance in the moor frog south of the Carpathians, no correlation north of the Carpathians – recent expansion

Rafi ński & Babik 2000 No correlation between genetic differentiation and geographic distance

• Pairwise FST low – recent expansion or ongoing extensive gene flow

• Pairwise FST high – drift much stronger than gene flow Fst Isolation by distance (IBD) Log km Fst Log km No IBD,No strong migration IBD, drift-migration equilibrium Hutchinson & Templeton1999 Fst Isolation by distance (IBD)

Fst Log km rf tegt migrationDrift otweights IBDNo Log km the larger geographic scale migration at Drift outweights IBD„Broken” Hutchinson & Templeton1999 Habitat structure and dispersal

• Ornate dragon ( Ctenophorus ornatus ) lives on granite outcrops in W Australia • Some outcrops surrounded by bush (Reserve), others by arable land (Cleared land) • No reduction in habitat size – outcrops not suitable for agriculture • 22 microsatellite loci • Comparison of gene flow between outcrops in reserve and in cleared land Habitat structure and dispersal

Levy et al. 2010 Structure

• How many genetically distinct groups inhabit the area? • K is the assumed number of groups (populations) • The algorithm estimates allele frequencies characteristic for each of K groups at multiple loci, and, simultaneously estimates for each individual the fraction of its genome derived from each group • The appropriate K is selected by the comparing the fit of the model to the data for various K values • Structure allows not only determination of the number of populations, classifying individuals to populations, but also identification of migrants, their progeny and interspecific hybrids

Lecis i in. 2006 Genomes contain historical information • Species age • Population size and its changes • Ancestral population size • Demographic aspects of speciation • Patterns of historical and contemporary gene flow between populations and species • Recombination rate This information can be extracted Gene genealogies

Gene genealogies = genealogies = gene trees = trees Representation of relationships between a sample of sequences from contemporary populations or closely related species Genealogies can be inferred if there is variation in the sample Genealogies exist also for invariable samples! Gene trees are not species trees

gene tree identical to gene tree different from species tree species tree, polymorphism in ancestral species Gene genealogies are affected demography

Constant size

Bottleneck

Exponential growth Barton et al. 2007 Sequence data are powerful and convenient • Available for any organism • Provide genealogical information • The only source of information when fossils/subfossils absent • Fossils of limited value on the recent timescale • But may be used for tracing range changes – e.g. pollen Sequence data are powerful and convenient • Available for any organism • Provide genealogical information • The only source of information when fossils/subfossils are absent • Fossils of limited value on the recent timescale • But may be used for tracing range changes – e.g. pollen • Combining genetic and fossil information in hypotheses formulating and testing A framework to extract info from sequences needed Coalescent theory

• Is about ancestry of a sample of sequences, NOT individuals • In diploid sexuals number of ancestors increases back in time, each individual has two parents • A gene copy (sequence) has always a single parent • Sometimes two gene copies have a common parent – they COALESCE

Felsenstein www Coalescent theory

• Coalescent is about ancestry of a sample • Time goes back to the Most Recent Common Ancestor (MRCA) of a sample Coalescent is about ancestry of a sample

• We trace the ancestry of the sample as lineages merge (coalesce) • The rate of coalescence depends on the number of lineages (i) and population size • In small populations coalescence is faster! i(i −1)

4N Kuhner www Coalescent theory

• Coalescent is about ancestry of a sample • Time goes back to the Most Recent Common Ancestor (MRCA) of a sample • Probability of a given genealogy for a given population size ( N) can be computed Probability of a genealogy can be computed • For a given N 1  i(i −1) P(G | N) = ∏ exp − u  2N i  i 4N  • But we have to know uis • We cannot measure time u • But we can measure 4 divergence u3 • Instead of N we have a composite parameter

Θ = 4µN u2 Coalescent theory

• Coalescent is about ancestry of a sample • Time goes back to the Most Recent Common Ancestor (MRCA) of a sample • Probability of a given genealogy for a given population size ( N) can be computed • Derived for the Wright-Fisher model… Wright-Fisher model

• Very simple but useful • Population size N constant • Non-overlapping generations • Next generation is drawn from an effectively infinite pool of gametes • All mutations are neutral

Kuhner www Coalescent theory

• Coalescent is about ancestry of a sample • Time goes back to the Most Recent Common Ancestor (MRCA) of a sample • Probability of a given genealogy for a given population size ( N) can be computed • Derived for the Wright-Fisher model… • But holds for many other models when effective population size Ne is used instead of N • Coalescent is a powerful simulation tool Coalescent is a stochastic process

• Different parts of a genome in sexuals have different histories • Trees for different genes in a given populations differ much due to

stochastic Hein et al. 2005 reasons Add more loci, not individuals! Coalescent is a powerful simulation tool • Only ancestry of a sample has to be simulated • Samples sizes can be small Increasing sample sizes doesn’t help much

• A few sequences usually represent full tree depth for a given locus • Sampling more sequences adds mainly lineages close to the tips and gives little additional information Coalescent is a powerful simulation tool • Only ancestry of a sample has to be simulated • Samples sizes can be small • Many scenarios can be simulated – hypothesis testing

Knowles & Carstens 2007 Carstens & Richards 2007 Bayesian inference

° aim: estimate parameter value – its probability distribution

prior information posterior probability about the parameter value of the parameter value combination of prior and likelihood probability

parameter value probability information contained in the data parameter value extracted using likelihood function calculating likelihood for genetic

likelihood data is computationally expensive or even impossible

parameter value Approximate Bayesian Computations (ABC) ° comparison between models ° estimation of model parameters ° we want to compare competing models, and estimate parameters of the better (best) one m21 Isolation with t t Isolation N1 N2 N1 N2 Migration m12 (IM)

NA NA ABC

° compute summary statistics from the observed data: Fst, shared polymorphisms, fixed differences etc. ° simulate datasets under the competing models, values of model parameters for simulations from priors ° compute summary statistics from simulated datasets ° retain simulations which produced summary statistics similar to observed ° proportion of simulations retained under a model approximates its posterior probability – model selection ° parameter values that generated retained simulations approximate posterior – parameter estimation ABC

° idea: these models (parameter values) which generated data similar to the observed data reflect the evolutionary process ° straightforward logic ° no need to calculate likelihood function ° flexible: even very complex models may be evaluated, because coalescent simulators can simulate data under complex models

Compute summary Generate sequence data Collect samples statistics from (wet lab/bioinformatics) observed data

Select & refine model ≈ ? Estimate its parameters Learn about biological processes M 1 Simulate data M 2 under models Compute summary Formulate (parameter statistics from models simulated data M 3 values from the prior Phylogenetic trees Both trees contain identical information; topology only A B C D E nodes O branches

A B C D E O Phylogenetic trees

This tree contains information about branch lengths, that is the „amount of evolution” along each branch

A B C D E O

0.01 Additive tree Phylogenetic trees

Here branch lengths are also shown, but under the assumption that each taxon is equally distant from the common ancestor – branch lengths are time-scaled

A B C D E O

0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Ultrametric tree Phylogenetic trees unrooted trees , both are identical and the „origin” or root can be placed anywhere

C D E O B A

0.01 E D O A B C

0.01 Phylogenetic trees

Rooted trees – these trees are very different, because they differ in root location C D E O B A root 0.01 E D O A B C

0.01 Phylogenetic trees

Usually tree is rooted with an outgroup

A B C D E O

0.01 rooth outgroup Phylogenetic trees

Mono-, para- i polyphyletism

A B C D E O

Monophyletic group – common ancestor and all its descendants

Only monophyletic groups are natural Phylogenetic trees

Mono-, para- i polyphyletism

A B C D E O

Paraphyletic group – common ancestor and only some of its descendants Phylogenetic trees

Mono-, para- i polyphyletism

A B C D E O

Polyphyletic group – taxa derived from different ancestors are grouped together

E D trees various ways

O C phylogenetic tree 0.01

B A A single single A Phylogenetic can be drawn be can in O can be drawn in in canbe drawn

E Phylogenetic Asingle hlgntctree phylogenetic 0.01

D A various ways various trees

B C Phylogenetic trees

A single phylogenetic tree can be drawn in various ways A B C D E O

0.01 A B C D E O

0.01 Phylogenetic trees

• Reconstruct relationships, are based on similarity • Can be unrooted or rooted • A single tree can be visualized in various ways • The branching order is called topology • Branch lengths may be uninformative, may show the amount of evolutionary change or be time-scaled • Enormous number of possible topologies, for 10 taxa over 10 millions, for 20 taxa – 2 x 10 20 Phylogenetic trees are inferred from various kinds of data

• Morphological characters • Behavioral characters • Life history traits • Allele frequencies (e.g. allozymes, microsatellites) • RFLP, AFLP data • Protein sequences • DNA/RNA sequences

Most important are DNA and protein sequences Past climatic changes – facts

• gradual cooling in the Cenozoic • onset of the Pleistocene glaciations ca 2.7 - 2.4 my BP

Lambeck et al. 2002 Zachos et al. 2001 Past climatic changes – facts

• cyclic climatic changes in the Pleistocene – long glacials and short interglacials beginning ca. 2.7 -2.4 Ma • palaeotemperature reconstructions from the Antarctic and Greenland ice cores and marine sediments

ice cores reconstructions

marine sediments reconstructions

EPICA 2004

• oscillations of increasing amplitude and periodicity changing from 41 to 100 ky began ca. 700 ky BP Past climatic changes – facts

• The last glacial period began ca. 115-110 ky BP • Its coldest part, the last glacial maximum (LGM) ca. 25-15 ky BP

Hewitt 2000 Tarasov et al. 2000 Organisms and climate change

• dramatic climatic changes may occur abruptly, even in less than 100 years • such rapid climatic changes are much too fast for organisms to adapt • organisms have to change their ranges tracing suitable climatic conditions or go extinct • temperate and boreal species particularly affected by the Pleistocene climatic oscillations • their Pleistocene history has been characterized by multiple expansions and contractions of the geographic ranges Organisms and climate change

Populations from the southern peninsulas gradually expanded their ranges as the climate ameliorated

With the onset of the next glacial cycle the populations in the north were going extinct

Thus, populations were constantly present only in the refugial areas; they have differentiated Phylogeography

• geographic distribution of genealogical lineages • range changes • population size changes • responses to climate changes • uses information contained in the genomes of contemporary organisms Why animal mtDNA?

• In the pre-PCR era easily accessible, restriction maps of the whole mtDNA molecule • PCR – universal primers and sequencing • No recombination • Maternally inherited – matrilinear genealogies • Rapid evolution

• Effective population size ( Ne) even 4 times smaller than for nuclear genes – faster lineage sorting • Majority of mutations neutral? Distribution of mtDNA lineages and phylogeography

Lansman et al. 1983 Problems with mtDNA

• mtDNA is a single „locus” • Gene phylogenies are not organismal phylogenies – inferences based on a single gene should be treated with caution Gene trees are not species trees

gene tree identical to gene tree different from species tree species tree, polymorphism in ancestral species Problems with mtDNA

• mtDNA is a single „locus” • Gene phylogenies are not organismal phylogenies – inferences based on a single gene should be treated with caution • Information on maternal lineages only • Selection – fixation of a beneficial mutation erases all historical information before such selective sweep • Nuclear pseudogenes – numtDNA, numts Nuclear genes in phylogeography Increasingly common use but: • Technical problems, gene families and paralogs • Recombination – analytical problems • Usually slower evolution – less information • Selection and linkage • Larger Ne – slower lineage sorting

The use of nuclear sequences is an absolute must in modern phylogeogaphy but mtDNA remains extremely useful for obtaining an initial, preliminary picture quickly Phylogeography of newts

L. v. vulgaris

L. v. ampelensis L. v. lantzi

L. v. schmidtlerorum L. v. meridionalis

L. v. graecus L. v. kosswigi The smooth and 1.0 Carpathian newts

1.9 1.0 High variation several old mtDNA lineages

2.6

3.0 1.3 3.4 4.5 Babik et al. 2005 Phylogeography of newts

The distribution of mtDNA lineages indicates the presence of the most important refugia in Central Europe Babik et al. 2005 Nuclear phylogeography of newts

sequences of 74 nuclear genes little admixture nuclear genes more concordant with morphology extensive introgression of mtDNA Abies alba (silver fir) Picea abies (Norway spruce) Comparative phylogeography

Limited colonization from Colonization from the east two refugia (cold-tolerant species)

Fagus sylvatica (common beech) Quercus sp. (white oaks)

[Q1] [F2] [Q3] [F1] [Q2]

From a single refugium From three refugia Comparative phylogeography

Various colonization patterns In general, species react individually

Hewitt 2000 Natural selection

• Differential survival and reproduction of individuals depending on their heritable (genetically determined) features • The measure of individual’s „quality” is its fitness – measured often as the lifetime reproductive success, the ability to produce as high as possible number of progeny of the highest possible quality • Fitness can be attributed to an individual, or to an allele, or to a genotype • Natural selection is the only known mechanism producing adaptations , features improving the functioning of organisms in their environment • Selection is certainly crucial at the organismal level, but what about the molecular level? Neutralists vs selectionists

Neutralists claim that most variation at the molecular level does not affect fitness – is neutral – thus drift is the most important mechanism of molecular evolution Selectionists claim that a substantial fraction of the observed molecular variation has been shaped by natural selection Possibly, the truth is somewhere in the middle; huge amounts of data from population genomics should help Caution! Both agree that most mutations in functional regions is unconditionally deleterious and as such is removed by selection -> the controversy is about the variation observed in populations Detecting selection in protein-coding genes

• Due to degeneracy of the genetic code, both synonymous and nonsynonymous sites • One can calculate the number of substitutions per synonymous site (dS) and per nonsynonymous site (dN): dN/dS, ω • Under neutrality expected ω = 1 • ω < 1 excess of synonymous substitutions, purifying selection • ω > 1 excess of nonsynonymous substitutions, positive selection • In most proteins purifying selection , its strength dependent on the degree of functional constraints imposed on protein’s sequence, some proteins extremely conserved, histones, ubiquitin Detecting selection in protein-coding genes • Even in genes evolving under positive selection most amino acid substitutions will be deleterious – thus the average dN/dS rarely > 1 although such cases are known • Methods identifying single codons under positive selection • Positive selection may affect many genes, but act only for short time periods • FOXP2 in the human lineage

Enard et al. 2002 Detecting selection in protein-coding genes

• Genes evolving primarily under positive selection are involved in: • Avoiding mating with relatives – self-incompatibility systems in plants • Fighting pathogen assault – major histocompatibility complex (MHC) genes in vertebrates • Reproduction and intersexual conflict– species-specific interaction between sperm and egg, proteins in seminal fluids of Drosophila • Alleles of these genes evolving under positive selection are very often also under balancing selection, which maintains in populations many old and highly divergent alleles • Such allelic lineages are often older than species (transspecific polymorphism) Balancing selection and transspecific polymorphism

• Self-incompatibility genes in plants prevent matings between relatives • Rare alleles favoured, high variation in populations • Allelic lineages maintained by selection for long time • Phylogeny of alleles and species discordant

Richman & Kohn 1999 Major Histocompatibility Complex (MHC)

• Gene-dense region in genomes of all jawed vertebrates • Immune system and other genes • MHC in ecology & evolution ≈ classical MHC class I and II • Key components of the adaptive immune response MHC class I and II

• Present pathogen antigens to T lymphocytes -> mount adaptive immune response • Amino acids in the Antigen Binding Sites (ABS) determine which antigens can be presented • Each MHC molecule presents a limited set of antigens

MHC I MHC II molecule molecule

Janeway et al. 2001 MHC genotype and disease

• Some MHC genotypes/alleles correlate with disease or immunity to disease: • HIV • Malaria • Parasites • Fitness consequences of such associations and effects on population viability poorly known • Low MHC variation has been linked to disease: cheetah, cottontop tamarins wikipedia Signatures of selection in MHC genes

• Almost universally detected in ABS • Selection: • Recurrent • Positive • Balancing • Signatures: • Excess of nonsynonymous (replacement) substitutions in ABS, dN/dS > 1 • Extreme variation – hundreds of alleles in a population • Trans-species polymorphism – old allelic lineages

Paradigm of adaptation at the molecular level Balancing selection on MHC genes

• Pathogens appear most important • Sexual selection, kin-recognition/inbreeding avoidance maternal-fetal interactions may contribute • Predominant form of balancing selection unclear: • Heterozygote advantage • Rare allele advantage (negative frequency dependent selection) • Fluctuating selection These mechanisms generate similar predictions – distinguishing between forms of selection difficult or even impossible MHC MHC class II in the European bison Białowie ża population, restored from 7 individuals, 4 MHC II alleles retained, very divergent, clear signatures of positive selection in ABS sites

Radwan et al. 2007 MHC Beaver MHC class II, very divergent alleles retained in the species as a whole, but no variation in populations, historically positive selection, recently the bottleneck effect – the role of drift

Babik et al. 2005 Genetic differentiation and selection

• Genes responsible for local adaptations will be more differentiated between populations • Genes under uniform balancing selection will show less differentiation

FST

H FST in human genes Aguilar & Garza 2006 Barton et al. 2007 Selective sweeps

• A new beneficial mutation will be driven to fixation • Variants present in this region on other chromosomes will be lost • Variants closely linked on a chromosome on which the beneficial mutation originated will be hithchiked to high frequency • The size of the sweep region depends on the strength of selection and recombination rate

Nielsen et al. 2007 Selective sweeps

Locally (on a chromosome) reduced variation and increased linkage disequilibrium are signatures of selective sweeps

Nielsen 2005 The age of beneficial mutations can be inferred • Each beneficial mutation appears in some „genomic context” • There will be some region of linkage disequilibrium (LD) • From the extent of LD and recombination, the age of the mutation can be inferred

Barton et al. 2007 Detecting positive selection in the genome Summary

• Strong selection leaves genomic signatures • Various signatures are retained for different amounts of time: • Long (my): increased fraction of functional changes • Moderate (hundreds ky) – reduction of variation caused by selective sweeps • Rather short (tens ky) – high frequency of derived alleles • Short (tens ky) – genetic differentiation among populations • Very short (several ky) – long haplotypes

Sabeti et al. 2006 Endangered species Conservation genetics

• The application of genetic, and particularly molecular methods for conservation of species, species assemblages and biomes • Provides information how to optimally use limited resources for conservation measures • Heterozygosity and population viability, inbreeding depression • Identification of relatives in breeding programmes • Identification of genetic diversity hotspots and delineation of the conservation units, MU, ESU • Identification of processed products derived from protected species – implementation of conservation laws Conservation genetics What kinds of questions are being asked? Variation, inbreeding and extinction risk

• Genetic variation is essential for adaptation • Majority (80%) of endangered species have lower variation than related non-endangered species • Increased inbreeding increases extinction risk • In small populations inbreeding increases, variation decreases, populations become more vulnerable to stochastic factors – extinction vortex

Frankham et al. 2010 Low heterozygosity is (often) bad • Molecular markers are best suited for assessing genome-wide heterozygosity (H) • Low H may be a consequence of population size reductions – bottlenecks • Very low variation in the cheetah • FIP virus caused 50-60% mortality in captive cheetahs compared to 1% in the domestic cat • High juvenile mortality, developmental defects • Gir Forest lions: low testosterone, abnormal sperm • Endangered Puerto Rico pigeon: reduced clutch size • Fish Poeciliopsis occidentalis : all assayed fitness components (survival, growth rate, fecundity and developmental stability) positively correlated with allozyme heterozygosity Population bottlenecks cause both the loss of variation and the reduction of evolutionary potential

Frankham et al. 2010 Inbreeding depression

• Inbreeding depression (ID) is the reduction of survival, fecundity or growth rate, as a consequence of mating among relatives • ID particularly important in conservation genetics because inbreeding level in small populations is often substantial • Genetically, inbreed populations have reduced heterozygosity (increased homozygosity) • Explanation – two hypotheses: • Dominance: reduction of fitness is caused by rare, recessive deleterious alleles which effects are exposed when homozygous, species with long inbreeding history should fare better (purging) • Overdominance: multilocus heterozygosity per se increases fitness • More support for the dominance hypothesis Inbreeding depression

Prediction from the dominance theory is that ‘purging of deleterious mutations’ should be observed

Barton et al. 2007 Inbreeding depression

• Widespread among organisms NOT reproducing by self-fertilization • Has a substantial stochastic component • Usually more pronounced under stressful conditions • Usually more pronounced in wild than in captivity • Does not occur in haploid organisms or in genes without dominance or overdominance • ID can be reversed in small populations by introducing individuals from other populations (of course only if they mate and produce progeny with residents) – genetic rescue 50/500 principle and MVP

Minimum Viable Population (MVP) – minimum size of population which will survive and will be able to adapt to changing conditions. MVP should correspond the effective population size (Ne ) = 50/500

But, if on average Ne = 1/10 N, then N (MVP) should be 500/5000

Frankham et al. 2010 Managing captive populations

Usually the target of management is retention of 90% initial genetic variation for 100 y Compromise necessary because of limited resources

Frankham et al. 2010 Outbreeding depression

Decreased fitness of progeny from mating between individuals from genetically divergent populations

Alpine ibex (Capra ibex) in the Tatra Mts – admixture of individuals from Turkey and Egypt caused, that young were born in February, resulting in 100% mortality of cold and starvation (Beebe i Rowe 2004)

Outbreeding depression is rare in interpopulation crossess. Translocations can be extremely effective in conservation, but risks related to transmission of pathogens, swamping local adaptations Taxonomic uncertainties

• The knowledge of evolutionary relationships of populations and their groups within species are extremely valuable in conservation • Sometimes inadequate knowledge of taxonomy may lead to species extinction -> cryptic species • Sometimes excessive effort is devoted to conservation of populations which are not really distinct from other, non-threatened populations • Resolving taxonomic uncertainties should be the first step in conservation strategies – molecular methods MU and ESU

• Evolutionary Significant Units (ESU) – populations and their groups characterized by the evolutionary history largely distinct from other such units (Ryder 1986). ESU are the main sources of historical (and adaptive) genetic diversity within species, and as such should be given priority in conservation • If population exchanges with other population so few migrants, that it becomes demographically independent from other populations, it is assigned the status of the Management Unit (MU) – conservation unit of a lower rank than ESU • If, as often the case, no sufficient data about migration are available, MU may be defined preliminary by the significant differences in allele frequencies at neutral marker loci, e.g. mtDNA ESU • ESU delineated primarily with molecular methods, one of the most popular definitions assigns the ESU status to such groups of populations which are „reciprocally monophyletic in mtDNA and differ significantly in the frequencies of nuclear alleles” (Moritz 1994) • ESU and MU represent different levels of intraspecific differentiations • Example: the green sea turtle • Breeding colonies are MU • Populations from oceanic basins constitute separate ESU Implementation of conservation laws

• Identification of the biological source of blood, meet, feathers and processed goods derived from endangered or illegally obtained species • Molecular markers enable species identification even in very difficult cases • Starting from 1970s allozymes were used for monitoring protective periods and protected areas in fishery • 1986 moratorium for whale hunting • Based on the database of mtDNA sequences, the implementation of the moratorium was monitored by checking the whale products from markets in SE Asia, Japan and Scandinavia – illegal hunting detected • Internet program of whale DNA surveillance • 1995 Ashford (Oregon) - U.S. Fish and Wildlife Service animal forensics lab Diagnosing genetic problems

• How large is the population ( Ne )? • Has is gone through bottleneck(s)? • Has it lost genetic variation? • Is inbreeding depression apparent? • How does the geographic distribution look like? • Is population fragmented to the such an extent that gene flow between fragments is impeded/impossible? Genetic rescue If population is small and suffers from inbreeding depression, translocations of individual from other populations may be very helpful

Genetic rescue

Frankham et al. 2010 Translocations • Helpful in saving populations of endangered species • Continuous monitoring • Before translocation: • Which individuals should be translocated • How many? • How often? • From where? • When to stop? • Translocations should be monitored both via observation and with molecular markers Captive breeding programs (CBP)

• In captivity for species particularly endangered in the wild; special crossing schemes to retain the maximum fraction of variation • The ultimate goal is reintroduction to the natural environment • Stages of CBP: • Detecting the danger of extinction • Establishment of the captive population • Attaining „safe” population size • Management in captivity • Selection of individuals for reintroduction • Management in nature

Frankham et al. 2010 Captive breeding programs (CBP) • Molecular markers enable reconstruction of pedigrees and correct errors in pedigree books: northern bald ibis ( Geronticus eremita ), Arabian oryx (Oryx leucoryx ) • Paternity identification in the CBP of the whooping crane ( Grus americana ) • Sex identification in kaki ( Himantopus novaezelandiae ) in the CBP initiated from just 12 animals • Non-invasive sex determination of the last individual of the Spix’s macaw ( Cyanopsitta spixii ) surviving in nature. It turned out male, so a female from captividty was provided Molecular ecology and Genetically Modified Organisms (GMO)

Potential effects of transgenic organisms on natural ecosystems can be investigated only by means of molecular techniques Transgenic organisms have genetic markers enabling quick and efficient detection of transgenes (usually real-time PCR-based detection) Real Time PCR

Stages of amplification during PCR Real Time PCR

The initial amount of template can be measured precisely if the increase of the amplified DNA is followed during the exponential phase. SYBR Green binds to double-stranded DNA Advantages of Real Time PCR

• In normal PCR product is assayed post-PCR • In RT PCR the assay is performed in real time during exponential amplification • Wide dynamic range of detection • No need of post-PCR analyses • Precise estimation of the starting amount of template, even 2-fold differences detectable Molecular ecology and Genetically Modified Organisms (GMO)

• Resistant to pests and diseases • Fast growth • Vaccines in food • Plants without toxins • Better taste and quality? • Pesticide-free soli and water • More than 1 mln sq km of fields with GMO • in USA > 80% corn, soya and cotton GMO Potential environmental threats of GMO

• More dangerous pests and pathogens • “Improvements” of existing pests due to hybridization with GM relatives • Dangerous for other than target species: soil organisms, insects, birds and other animals • Disturbances in ecosystem functioning through the multiple indirect and direct effects • Irreversible changes in biodiversity or genetic diversity of species Direct ecological effects of GMO

Sanvido et al. 2007 Indirect ecological effects of GMO

Sanvido et al. 2007 Molecular mapping of quantitative traits

• Many fitness-related traits are quantitative: body mass, growth rate, fecundity, aggression level etc. • Usually multigenic • Analysis of quantitaive traits utilizes statistical techniques: correlation, covariance, analysis of variance and is based on measuring correlation between relatives – the knowledge of the genetic architecture of traits is not necessary • Quantitative Trait Loci (QTL) analysis identifies genomic fragments responsible for variation in a quantitative trait Molecular mapping of quantitative traits

• How many genomic regions affect variation of a trait? • Are some of these regions more important than others?

Barton et al. 2007 Molecular mapping of quantitative traits

• A dense genetic map, usually based on microsatellites or SNPs is essential

Barton et al. 2007 • Selection for the trait of interest Molecular mapping of quantitative traits

Wing shape in Drosophila

Barton et al. 2007 Genes underlying complex traits – QTL and candidate genes • Freshwater and marine stickleback populations < 15 kya • In marine sticklebacks well developed pelvic girdle forms the support for ventral spines - adaptation Genes underlying complex traits – QTL and candidate genes

Barton et al. 2007 Genes underlying complex traits – QTL and candidate genes

Barton et al. 2007 Genes underlying complex traits – QTL and candidate genes

• Freshwater and marine stickleback populations < 15 kya • In marine sticklebacks well developed pelvic girdle forms the support for ventral spines – adaptation • QTL analysis identified a chromosomal region explaining 13 – 44% of the pelvic girdle and spine variation • In this region Pitx1 gene affecting the development of pelvis and rear limbs in mouse Genes underlying complex traits – QTL and candidate genes

Barton et al. 2007 Genes underlying complex traits – QTL and candidate genes • Freshwater and marine stickleback populations < 15 kya • In marine sticklebacks well developed pelvic girdle forms the support for ventral spines - adaptation • QTL analysis identified a chromosomal region explaining 13 – 44% of the pelvic girdle and spine variation • In this region Pitx1 gene affecting the development of pelvis and rear limbs in mouse • Its role has been confirmed with in situ hybridization • No differences in the coding region – expression level and localization Genes underlying complex traits – QTL and candidate genes

Barton et al. 2007 Molecular mapping of quantitative traits

• QTL analysis has many limitations • Loci of small effect remain undetected • Overestimate effect for the loci of large effect • Limited resolution ca 20 cM ~ 20 Mb, recombination rate is an inherent limitation, identified region may contain many genes • Looking for correlations between a trait and genetic markers in populations may be useful – association mapping Genetic basis of adaptation

• Convincing identification of the genetic basis of adaptation is challenging • Ideally connections between a given variant (allele), genotype , phenotype and fitness should be demonstrated

Barrett & Hoekstra 2011 Detecting genes underlying adaptation

Barrett & Hoekstra 2011 Detecting genes underlying adaptation

Barrett & Hoekstra 2011 Eda in sticklebacks – adaptation from standing variation

Eda gene encodes protein ectodisplasin, a signal molecule involved in the development of the ectodermal structures In sticklebacks distinct Eda allelic lineages determine the presence of or absence of bony plates

Colosimo et al. 2005 Eda in sticklebacks – adaptation from standing variation

Each allelic lineage is monophyletic, alleles for low plate-number occur in marine population in low frequencies During colonization of freshwater environments their frequency increases – they do not originate de novo – adaptation from standing genetic variation

Colosimo et al. 2005 Genomics in search for adaptations

• approaches based on QTL and candidate genes provided useful information but have limitations • genome-wide differentiation between populations from contrasting environments/with contrasting phenotypic traits • candidate genes not necessary • genomic regions will be often of moderate size – historical recombination – limited number of candidate genes in the region • these regions are just candidates – experimental validation required Adaptation to fresh water in sticklebacks

Barrett & Hoekstra 2011 Which parts of the genome differ? Are differences repeatable? Sequenced and assembled genome ca. 460 Mb Hohenlohe et al. 2010 Adaptation to fresh water in sticklebacks

• 2 oceanic populations • 3 freshwater populations • 20 individuals/pop • > 40 000 RAD-tags • probabilistically assigned genotype in each nucleotide position • > 45 000 SNP • RAD-tags mapped to the genome

Hohenlohe et al. 2010 Adaptation to fresh water in sticklebacks

Hohenlohe et al. 2010 Adaptation to fresh water in sticklebacks

Hohenlohe et al. 2010 Adaptation to fresh water in sticklebacks

High heterozygosity, high nucleotide diversity and low differentiation among populations indicate balancing selection in the linkage group III, many genes involved in the immune response

Hohenlohe et al. 2010 Adaptation to fresh water in sticklebacks

between oceanic

oceanic – freshwater 1

FST oceanic – freshwater 2

oceanic – freshwater 3 Population genomics of adaptation to heavy metals • Arabidopsis lyrata – populations from normal and serpentine soils • Which genes are responsible for adaptation to serpentine soils? • Small (ca. 200 Mb), sequenced genome

Serpentine Normal Pair 1 nearby distant

Pair 2

Turner et al. 2010 Population genomics of adaptation to heavy metals • 25 plans per population, equal DNA amounts pooled • Pools sequenced on Illumina separately for each population to the average 10 x coverage per population • Enough coverage to detect alleles (SNPs) with large frequency differences between populations • SNPs occuring in both pairs will be good candidates for adaptation loci • Ca. 80 candidate genes were found, many involved in heavy metal detoxification and calcium-magnesium transport • Three candidate genes were studied also in Scottish populations where they showed similar differentiation

Turner et al. 2010 Future of molecular ecology?

• Accessibility of markes for any organism • Fewer technical limittions • Faster laboratory analyses • Data storage and analysis more challenging Time devoted to project stages

Experimental design

Sample collection Sanger Preparation for sequencing era sequencing Next generation molecular ecology Sequencing

Data analysis