Population genomics and comparative genomics
This lecture: Population genomics Next lecture: Comparative genomics: Daniel Jeffares Email [email protected]
Updated 19/4/2020 Population genomics and comparative genomics
This lecture: population genomics
• What is population genomics? • How we gather the data. • What can we find out from population genomics (that we can’t from population genetics). • Concepts: • Genetic diversity (origin, and movement though space and time) • Genome wide summary statistics (π, allele frequencies (MAF, DAF), Tajima’s D • Population structure • Purifying selection: expectations and observations • Adaptive selection: expectations and observations • Balancing selection: expectations and observations • Polygenetic selection and genome-scale data • Linkage of alleles on chromosomes • Case studies: • Selective sweeps in malaria parasites • Altitude adaptation in Tibetans Comparative genomics
Next lecture: comparative genomics • What is comparative genomics? • How we gather the data. • What can we find out from comparative genomics • Concepts: • Diversity within species gives rise to divergence between species • Evolutionary rates • Purifying selection (constraint): expectations and observations • Adaptive evolution: expectations and observations, tests for selection • Polygenetic selection and genome-scale data • Case studies: • Evolutionary constraint in mammalian genomes • The McDonald-Kreitman test and evolution in the human genome
All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p Population genomics What is population genomics?
• Population genetics is the study of genetic variation within species. • Population genomics expands the data to study variation within species using whole genome data. • Population genomics: • Is more challenging data to gather, more expensive, more challenging to analyse • Genome-scale data produces a more comprehensive picture • Demography (population size, migration, population structuring) • Natural selection (purifying, adaptive, balancing) How we gather population genomics data
1. Hypothesis/query 2. Sample collection and DNA extraction a) Choose geographic/habitat region of interest b) Gather hundreds to thousands of individuals (strains/subjects) from within a species (sometimes tens of thousands) c) Extract genomic DNA from each individual 3. Genome sequencing a) Aim: to obtain sequence data covering the entire genome 5x to 40x coverage a) Coverage: how many read per site b) Sequence genome using ‘short read’ technology (usually Illumina) c) Main issue: cost/base Reads from one individual, aligned to a reference genome 2x coverage 4x coverage How we gather population genomics data
4. Read mapping and ‘variant calling’ a) Locate genetic variants (sites the genome that differ between individuals, polymorphisms) a) eg: SNPs, indels etc b) By mapping (aligning) sequence reads to a reference genome, and identifying sites the genome that differ
5. Segregating genetic variants are (usually) the final data set a) A list of positions that vary. Alleles/polymorphisms/variants.
6. Analysis: a) Describing demography b) Detecting selection c) Quantitative genetics (like GWAS) Some other strain/individual we are comparing to the reference genome
reference genome Reads from one individual, aligned to a reference genome reference genome
Some other strain/individual we are comparing to the reference genome
reference genome G Some other strain/individual G we are comparing to the G reference genome T T T
Reads from one individual, aligned to a reference genome Sequencing technology now At this point in time this is what we do:
Sanger sequencing Using ABI machines. • Technology of choice for low- to medium output sequencing • Eg: checking plasmids, small-scale population surveys Illumina: • Technology of choice for genome re-sequencing (of populations) • Widely used for small de novo genome assemblies, RNA-seq (sequencing transcriptomes), chip-seq, 3C, metagenomics
Pacific BioSciences (PACBIO) One technology of choice for genome assemblies. • Produces fairly long, fairly accurate reads
Oxford Nanopore • One of the technologies of choice for genome assemblies. • Produces the longest reads • Has the worst error rate • Is developing very fast
Sequencing technology develops rapidly. Next year this slide may be different The falling costs of sequencing has transformed evolutionary biology. What we can find out from population genomics (that we can’t find out from population genetics) • Demography • Estimates of population size • Estimates of population size though time (backwards only!) • Population structure (which individuals are more/less closely related) • Migration and ‘gene flow’ between populations (also past migration) • Inbreeding/outbreeding rates • Selection • Which regions of the genome are subject to strong purifying (negative) selection • functional analysis of genomes • Recent/ongoing events of adaptation (positive selection) • Quantitative genetics • GWAS: which alleles contribute to traits?
Demography, selection and quantitative genetics are intimately related. Concepts in population genomics
All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p Concept: genetic diversity
• Polymorphisms/alleles/variants: sites in a genome that differ between individuals of a species • Single nucleotide polymorphisms (SNPs) • Small insertion/deletions (indels) • Transposon insertions • ‘Structural’ variants: duplications, rearrangements, large insertions/deletions • Initial origin: a mutation in one individual • All polymorphisms begin their existence in just one individual
• Polymorphisms then move through space and time, within the population
• Their frequency in the population will change Rare Common individual 1 individual 2 individual 3 ATCCCG-TAAATTTT individual 4 AGCCCG-TAAATTTT individual 5 individual 6 AGCCCGTTAAAGTTTT A genomic site/gene from many individuals within a species Concept: genetic diversity
One population
Another population
A genomic site/gene from many individuals within a species Concept: genetic diversity
One population
Another population Concept: genetic diversity
One population
Another population Concept: genetic diversity
One population
Another population Concept: genetic diversity
Variants tend to remain within their original Most variants are rare population
One population
Physically linked variants tend to travel together
Another population
Many signals can be found within polymorphism data. The patterns of polymorphism data are complex. Hence: summary statistics Concept: genome-wide summary statistics
• Average pairwise similarity (‘diversity’, π): • If we compare every sequence to every other, what is the average number of differences? • The number of segregating sites (S): A genomic site/gene from many individuals within a species • Allele frequencies: ATCCCG-TAAATTTT • Minor allele frequency (MAF) AGCCCG-TAAAGTTT • Derived allele frequency (DAF) AGCCCGTTAAATTTT • Each of these statistics can be described as: AGCCCGTTAAATTTT • A histogram (a distribution) Ancestral state • Plots along chromosomes
See: Nucleotide diversity (π) on Wikipedia: https://en.wikipedia.org/wiki/Nucleotide_diversity Watterson estimator of � on Wikipedia https://en.wikipedia.org/wiki/Watterson_estimator Concept: genome-wide summary statistics: Tajima’s D
Tajima’s D uses summary statistics to detect selection (or demography) S S n-1 S is the number of segregating sites ! n-1 • The expected number of segregating sites, � = ! ∑ " N is the number of sequences compared ∑ " i=1 • In a neutral site π = � i=1 S • Fumio Tajima worked out that π ≠ � in certain circumstances n-1 ! For six sequences: • Tajima’s D is approximately = π - � ∑ " Sum of: ½ + 1/3 +1/4 +1/5 i=1 • When D is negative: • more rare alleles that expected • selective sweep (or expanding population) In theory, � = π = 4Nμ Where N = population size • When D is positive: (too few rare alleles) μ = mutation rate • Balancing selection or shrinking population or population structure So π can give us an estimate of population size.
Tajima’s D on Wikipedia https://en.wikipedia.org/wiki/Tajima%27s_D Excellent explanation of π, � and Tajima’s D on Youtube: https://www.youtube.com/watch?v=wiyay4YMq2A Concept: genome-wide summary statistics: Tajima’s D
Tajima’s D = 0 One population Neutrally evolving, stable population
Another population
When D is negative: more rare alleles that expected One population selective sweep Another population or expanding population
• When D is positive:
• more common alleles than expected One population • balancing selection • shrinking population • population structure
Tajima’s D on Wikipedia https://en.wikipedia.org/wiki/Tajima%27s_D Excellent explanation of π, � and Tajima’s D on Youtube: https://www.youtube.com/watch?v=wiyay4YMq2A Concept: Population structure
Leslie 2015 Variants tend to remain within their original population
One population
Another population
• When people, animals, plants or microbes move about, they carry their DNA with them. • Every individual carries its history within its DNA • Because we know the ‘rules’ (mutation, drift, recombination) population movements can be modelled • Because populations can have small contributions from one/other populations, this can’t always be drawn on a phylogenetic tree.
All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p ARTICLES
Figure 3 Relationships between genetic a 1.5 Diversity diversity and genome function. (a) Main features
of diversity in the genome, with chromosome ) 1.0 scale on the x axis and the mitochondrial –2 10 ×
genome on the right edge. Top, diversity ( 0.5 (Watterson’s ) calculated using SNPs. Middle, recombination rate (scale, LDU/Mb × 10−3 above the x axis and log(1 + LDU/Mb) below the x axis). The six major recombination hotspots 012345 01234 012 are indicated with red dots. Bottom, sites of Recombination rate Tf family LTR insertion (insertions present 15 in all strains are shown in light blue) in the 10 group of 57 non-clonal strains. (b) Diversity 5 described by genome annotation. Distribution 0
of Watterson’s values for each centile of the LDU/Mb genome, using only sites annotated as exons 0.1 (EXO), 5 and 3 UTRs (5UT and 3UT), introns 1 15 (INT), lncRNAs (RNA), unannotated regions (NIL), LTRs of Tf2 family transposons (LTR), 01234 5 01234 012 and onefold-degenerate (1FD) and fourfold- Transposon insertions degenerate (4FD) sites in exons. Protein-coding 50 categories have red borders. The horizontal red 25 lines correspond to the median and interquartile with insertion 0
range for fourfold-degenerate sites; annotation Number of strains 012345 01234 012 Concept:classes with Purifying diversity significantlyselection lower– expectations than and observations Chr. 1 Chr. 2 Chr. 3 the diversity for this proxy for neutral sites are Chromosome position (Mb) shaded gray. One-sided paired Mann-Whitney Purifying is the loss of deleterious (harmful) variants –16 U test P values in comparison to the fourfold- r = –0.50 P < 1 × 10 This process will 10 degenerate sites were as follows: exons, UTRs b c –1.8 • andReduce onefold-degenerate diversity in regions sites, that areP < important 2 × 10−16; 8 introns, P = 1 × 10−6; lncRNAs, unannotated –2.0 • Increase the proportion of rare alleles regions and LTRs, P > 0.05 (whiskers define the