ADAPTIVE AND NEUTRAL EVOLUTIONARY INSIGHTS FROM STATISTICAL ANALYSES OF POPULATION GENETIC DATA

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF BIOLOGY AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Julie Marie Granka March 2013

© 2013 by Julie Marie Granka. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/md970sh0103

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Marcus Feldman, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Hunter Fraser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Dmitri Petrov

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Hua Tang

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii iv Abstract

Organismic evolution involves both selective and neutral forces, although their relative contributions are of- ten unknown. This thesis proposes novel statistical methods for analyzing genetic data from a variety of organisms, including yeast, Mycobacterium tuberculosis, and . The chapters of this thesis provide complimentary perspectives on the relative roles of selection and neutrality, from the molecular to the pop- ulation level, and present various statistical tools for genetic data analysis. Chapter 2 proposes a maximum- likelihood based method with which to classify and identify interactions, or epistasis, between pairs of . Chapter 3 details a study of genetic data from Mycobacterium tuberculosis isolated from Aboriginal Canadian communities; our analyses suggest that the bacterium spread to these communities via the Canadian fur trade in the 18th and 19th centuries. Chapter 4 discusses the detection of signatures of natural selection in the genomes of 12 diverse African human populations, and proposes novel considerations for identifying biological functions under selection and for comparing signals of selection between populations. Finally, Chapter 5 details the inference of the genetic basis and evolutionary history of light skin pigmentation and short stature in the genetically diverse 6=Khomani Bushmen of the Kalahari Desert of South Africa, believed to be one of the world’s most ancient human populations. These chapters emphasize that a more complete understanding of the evolutionary history of humans and other organisms requires not only the consideration of neutral and selective processes, but also both phenotypic and genetic information. The statistical meth- ods and approaches presented in the following chapters have the potential to improve inferences of natural selection and demography from genetic data, as well as provide insight into the relative roles of both.

v Acknowledgements

So many people have been tremendously supportive during my years as a graduate student at Stanford. I would like to extend my gratitude to them. First, I would like to thank my advisor Dr. Marc Feldman. Marc has been a constant source of support, encouragement, and knowledge. I have learned so much from Marc about conducting research, formulating scientific questions, and drawing meaningful conclusions. I thank Marc for giving me the opportunity to be involved in so many exciting and unique research projects, and I truly appreciate the freedom he allowed me in pursuing my own intellectual interests. My committee members Dr. Hua Tang, Dr. Dmitri Petrov, and Dr. Hunter Fraser have been inspirations to me during the course of my program, and I thank them for their ongoing support and feedback. A special thanks to Dr. Carlos Bustamante for introducing me to population genetics many years ago and for his continued guidance. I would also like to thank Dr. Daniel Fisher, the external chair on my thesis committee. Dr. Brenna Henn has been an incredible mentor to me over the past several years. I have always been able to turn to Brenna for advice – receiving in return not only answers and assistance, but motivation to tackle even greater issues. Brenna has taught me a great deal, and I can’t thank her enough. My sincere thanks to Dr. Caitlin Pepperell, whose enthusiasm and curiosity were contagious. I learned a great deal from Caitlin about the process of scientific inquiry, as well as the value of interdisciplinary collaboration. I would also like to thank Dr. Hong Gao, who worked with me on my first project as a graduate student. Many thanks to my collaborators Dr. Jeff Kidd and Chris Gignoux for their helpful advice and sugges- tions, as well as Dr. Mike Palmer for his support in running jobs on our computer cluster. I would also like to thank my collaborators from the University of Stellenbosch in South Africa; in particular, Dr. Eileen Hoal, Dr. Marlo Möller and Dr. Cedric Werely for working with me in the field. Much of my work would not be possible without the support of the South African San Institute (SASI), our kind, patient and knowledgeable interpreters, and the 6=Khomani San community for their support and interest in our research. The administrative staff of the Biology Department, including Dan King, Valerie Kiszka, Claudia Ortega, Matt Pinheiro, and Monica Bernal, has been extremely helpful in guiding me through the Ph.D. program. I would especially like to thank both Larry Bond and Jim Collins for their help and assistance while I have been at Stanford. Not least, I extend a huge thank you to all of my labmates, who are among my closest friends. Graduate school would have been far less enjoyable without our hallway and doorway discussions and their inventive

vi ice creams and frozen yogurts. In particular, I want to thank past and current Feldman Lab members Michal Arbilly, Lily Blair, Elhanan Borenstein, Amanda Casto, Oana Carja, Nicole Creanza, Laurel Fogarty, Robert Furrow, my officemate Philip Greenspoon, Edith Katsnelson, my former officemate Laurent Lehmann, Misha Lipatov, Mike McLaren, Ricardo Pinho, Allison Rhines, Deb Rogers, Marcel Salathé, Jeremy van Cleve and Daniel Weissman. I would also like to thank Nandita Garud, Philipp Messer, Leithen M’Gonigle, Dane Klinger, and the amazing friends who have supported me prior to and during graduate school. Finally, I extend a huge thank you to my family. My parents, Rachele and Edward Granka, have been nothing but encouraging and supportive, and I thank them for their infinite caring, advice, and understanding. A thank you isn’t enough for my sister Laura Granka and her endless sage advice, humor, motivation, and friendship. I also thank my Uncle Lu for his thoughtful support, as well as the rest of my family: uncles, aunts, cousins, and loving grandparents.

vii Contents

Abstract v

Acknowledgements vi

1 Introduction 1

2 On the Classification of Epistatic Interactions4 2.1 Introduction...... 5 2.2 Methods...... 7 2.2.1 Epistatic Subtype Selection...... 7 2.2.2 Generation of Simulated Data...... 10 2.2.3 Analysis of Experimental Data...... 10 2.3 Application to Simulated Data...... 11 2.3.1 Estimation of Epistatic Measures...... 11 2.3.2 Bias and Variance of the Epistatic Parameter ()...... 12 2.4 Application to Experimental Data...... 12 2.4.1 St Onge et al. [255] Dataset...... 12 2.4.2 Jasnos and Korona [131] Dataset...... 13 2.5 Discussion...... 14 2.6 Acknowledgements...... 17 2.7 Figures and Tables...... 18

3 Dispersal of Mycobacterium tuberculosis via the Canadian fur trade 22 3.1 Introduction...... 24 3.2 Results...... 25 3.3 Discussion...... 27 3.4 Materials and Methods...... 29 3.4.1 Population Descriptions...... 29 3.4.2 Genotyping Methods...... 29 3.4.3 Mutation Rate Estimate...... 30

viii 3.4.4 Statistical Calculations...... 30 3.4.5 Demographic Modeling...... 31 3.5 Acknowledgments...... 31 3.6 Figures and Tables...... 32

4 Limited Evidence for Classic Selective Sweeps in African Populations 37 4.1 Introduction...... 38 4.2 Methods...... 39 4.2.1 Samples...... 39 4.2.2 Single SNP Statistics...... 40 4.2.3 Haplotype Statistics...... 40 4.2.4 Coalescent Simulations...... 42 4.2.5 Annotation Enrichment for Haplotype Statistics...... 43 4.3 Results...... 44 4.3.1 Single SNP Statistics...... 44 4.3.2 Haplotype Statistics...... 45 4.3.3 Enrichment of Biological Functions...... 47 4.4 Discussion...... 48 4.4.1 Population Differentiation...... 48 4.4.2 Comparison of Empirical Tail Windows...... 49 4.4.3 Enrichment of Biological Functions...... 50 4.4.4 Complicating Factors for Studies of Selection...... 51 4.5 Data Access...... 52 4.6 Acknowledgments...... 52 4.7 Figures and Tables...... 54

5 Evolution of Height and Skin Pigmentation in the 6=Khomani Bushmen 62 5.1 Introduction...... 64 5.2 Methods...... 66 5.2.1 Samples...... 66 5.2.2 Relatedness...... 67 5.2.3 Ancestry Estimation...... 67 5.2.4 Statistical Analyses...... 68 5.3 Results...... 70 5.3.1 Ancestry...... 70 5.3.2 Phenotypes...... 71 5.4 Discussion...... 74 5.4.1 Ancestry Estimation...... 74

ix 5.4.2 Height...... 74 5.4.3 Skin Pigmentation...... 76 5.4.4 Conclusions...... 78 5.5 Acknowledgments...... 78 5.6 Figures and Tables...... 79

A Supplemental Materials: On the Classification of Genetic Interactions 97 A.1 Supplemental Methods...... 97 A.1.1 Derivation of MLEs for Multiplicative and Minimum Epistasis...... 97 A.1.2 Simulation of Double Mutant Fitness Values...... 97 A.1.3 Details of Analysis of Experimental Data...... 97 A.2 Supplemental Results...... 98 A.2.1 Additional Analyses of Simulated Data...... 98 A.2.2 Additional Analyses of Experimental Data...... 99 A.2.3 Consideration of Alternative Model Selection Procedures...... 101 A.3 Supplemental Figures and Tables...... 103

B Supplemental Materials: Dispersal of Mycobacterium tuberculosis via the Canadian fur trade 118 B.1 Supplemental Background Material...... 118 B.1.1 A Brief History of the Canadian Fur Trade, 1640-1870...... 118 B.1.2 Western Canadian Indigenous Populations in the Industrial Era, 1870-present.... 118 B.2 Supplemental Methods...... 119 B.2.1 Aboriginal Population Descriptions...... 119 B.2.2 Real-time PCR Determination of Bacterial Lineage...... 119 B.2.3 Mutation Rate Estimation...... 120 B.2.4 Statistical Calculations...... 121 B.2.5 Rejection Sampling Details...... 123 B.3 Supplemental Results...... 127 B.3.1 Genetic Timing Estimates...... 127 B.3.2 Patterns of Genetic Differentiation...... 129 B.3.3 Estimates of Population Diversity...... 129 B.3.4 Rejection Sampling Validation Results...... 130 B.3.5 Results of Rejection Sampling Procedures...... 131 B.3.6 Historical Estimates of Migration between Quebec and the Northwest, 1710-1870.. 132 B.4 Supplemental Figures and Tables...... 135

x C Supplemental Materials: Limited Evidence for Classic Selective Sweeps in African Populations143 C.1 Supplemental Methods...... 143 C.1.1 Samples...... 143 C.1.2 Population Differentiation...... 143 C.1.3 Choice of Several Haplotype Statistic Parameters...... 144 C.1.4 Comparison of Empirical Tail Windows...... 148 C.1.5 Neutral Coalescent Simulations...... 148 C.1.6 Selective Sweep Coalescent Simulations...... 149 C.1.7 Gene Ontology Enrichment...... 150 C.2 Supplemental Results...... 152 C.2.1 Population Differentiation...... 152 C.2.2 Haplotype Statistics...... 154 C.2.3 Gene Ontology Enrichments...... 156 C.3 Supplemental Figures and Tables...... 159

Bibliography 181

xi List of Tables

2.1 Fractions of simulations that recover the true model (null or epistatic) among the six models (three null models and three epistatic models) with different values of the epistasis coefficient, fitness standard deviation, and mean fitnesses of single mutants...... 19 2.2 Summary of gene pairs with the indicated epistatic subtypes, inferred using the FDR proce- dure with the BIC method...... 20 2.3 Comparison of validation measures for each dataset for different variations of the FDR and BIC procedure, considering only a subset of epistatic subtypes with their corresponding null models...... 21

3.1 Within- and between-lineage AMOVA based on 12 minisatellite loci for M. tuberculosis from Quebec, Ontario, Saskatchewan, and Alberta...... 35 3.2 Genetic timing estimates for DS6Quebec M. tuberculosis lineage in calendar years...... 36

4.1 Significantly enriched Gene Ontology (GO) terms in the top empirical 0.1% of windows in each population...... 60 4.2 Significantly enriched GO terms in the top empirical 0.1% of windows in each population, continued...... 61

5.1 Pearson correlations of self-reported measures of ancestry with ancestries inferred for each individual...... 92 5.2 Means (and standard deviations) of measured phenotypes for all individuals and the subset of genotyped individuals...... 93 5.3 Estimates of heritability (h2) of height, skin pigmentation, and tanning (wrist minus inner arm pigmentation) using various methods...... 94 5.4 Coefficients for regressions of height and pigmentation on ancestries...... 95 5.5 Top SNPs associated with height from a genome-wide association study using EMMAX... 96

A.1 Fractions of simulations that recover the true model among the six models when allowing both beneficial and deleterious mutations...... 112

xii A.2 Fractions of simulations that recover the true model among the six models, with different values of the epistasis coefficient and mean fitnesses of single mutants; number of replicates is equal to 10...... 113 A.3 Fractions of simulations that recover the true model among the six models with different values of the epistasis coefficient and mean fitnesses of single mutants; number of replicates is equal to 2...... 114 A.4 Summary of gene pairs with the indicated epistatic subtypes, inferred using the FDR proce- dure with the BIC method which considers all three epistatic subtypes and their corresponding null models (Study S, fitnesses measured in the absence of MMS)...... 115 A.5 Comparison of validation measures for each dataset for different variations of the FDR and BIC procedure, considering only a subset of epistatic subtypes with their corresponding null models (Study S, fitnesses measured in the absence of MMS)...... 116 A.6 Comparison of validation measures for each dataset for different variations of the FDR and BIC procedure, considering only a subset of epistatic subtypes with their corresponding null models; significance assessed by permutation...... 117

B.1 Frequencies of M. tuberculosis lineages in four populations...... 140 B.2 Fur trade migrants, Quebec to the Northwest, 1710 - 1870...... 141 B.3 Estimated mutation rates for M. tuberculosis minisatellite loci, per locus and per transmission generation...... 142

C.1 Populations analyzed...... 177

C.2 Mean FST values between pairs of populations...... 178 C.3 Mantel correlations between matrices of the indicated variables for each haplotype statistic.. 179 C.4 Significantly enriched Gene Ontology (GO) terms in the top empirical 1% and 5% of win- dows in each population...... 180

xiii List of Figures

2.1 Distribution of the epistasis values () for significant epistatic pairs in the Jasnos and Korona [131] dataset...... 18

3.1 Fur trade geography, regional M. tuberculosis lineage frequencies and historical timeline... 32 3.2 Minimum spanning tree of minisatellite haplotypes (12 loci) from M. tuberculosis within the DS6Quebec lineage, in QU, ON, SK, AB...... 33 3.3 Genetic diversity of M. tuberculosis populations...... 34

4.1 Permutation procedure used to assess significance of enrichment of Gene Ontology (GO) terms in the top x% of 100 kb genomic windows...... 54

4.2 Enrichment of genic and non-genic SNPs for values of FST and for average values of derived allele frequency difference (|δ|) between pairs of populations...... 55

4.3 Extreme |δ| values versus mean FST for pairs of populations...... 56 4.4 P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH MKK statistic in each population...... 57

4.5 Number of shared 100 kb genomic windows in the top empirical 1% for iHS vs. mean FST for pairs of populations...... 58 4.6 Number of shared 100 kb genomic windows in the top empirical 1% for XP-EHH vs. mean

FST for pairs of populations...... 59

5.1 Inferred pedigrees for a subset of related sampled individuals...... 79 5.2 Individual ancestry proportions for the indicated number of ancestral populations inferred for phenotyped individuals using ADMIXTURE...... 80 5.3 Cross-validation errors and log likelihoods for models estimated using ADMIXTURE for var- ious numbers of ancestral populations...... 81 5.4 Ancestry proportions versus age of phenotyped 6=Khomani Bushmen individuals...... 82 5.5 Violin plots of height and skin pigmentation measurements for genotyped individuals..... 83 5.6 Results from a linear mixed model regression of height on sex and age...... 84 5.7 Results from a linear mixed model regression of tanning (wrist minus inner arm pigmentation) on sex and age...... 85

xiv 5.8 Results from a linear mixed model regression of height on sex, age, and proportion European ancestry...... 86 5.9 Results from linear mixed model regressions of inner arm skin pigmentation on proportion ancestry, for various ancestries...... 87 5.10 Genetic relationships for pairs of individuals inferred from EMMAX using the Balding-Nichols method...... 88

5.11 QQ plots of −log10 SNP p-values for genome-wide association studies of height and inner arm pigmentation using EMMAX...... 89

5.12 Manhattan plots of −log10 SNP p-values for height association using EMMAX...... 90 5.13 Effect sizes by genotype and allele frequencies in worldwide HGDP populations for top height SNPs...... 91

A.1 Power of our method to uncover the true (simulated) epistatic model...... 103

A.2 Power of our method to uncover the true (simulated) epistatic model assuming ij = 0.3,

σij = 0.1, and µi = µj = 0.8...... 104 A.3 The distribution of the mean squared error (MSE) of the MLE of the epistatic parameter under the additive, multiplicative, and minimum epistatic models as well as the corresponding null model...... 105 A.4 Venn diagram comparing the overlap of epistatic pairs found examining the St Onge et al. [255] dataset using the authors’ original method and our BIC method...... 106 A.5 Distribution of the epistasis values () for significant epistatic pairs, determined using the FDR procedure with the BIC method for all gene pairs of the St Onge et al. [255] dataset in the presence of MMS...... 107 A.6 Several epistatic networks constructed based on the St Onge et al. [255] dataset...... 108 A.7 Several epistatic networks constructed based on the Jasnos and Korona [131] dataset..... 109 A.8 Quantile-Quantile plot of the single-mutant fitness values for several genes analyzed in the St Onge et al. [255] dataset...... 110 A.9 Quantile-Quantile plot of the single-mutant fitness values for several genes analyzed in the Jasnos and Korona [131] dataset...... 111

B.1 M. tb. minimum spanning trees based on minisatellite data and lineage typing...... 135 B.2 Diversity calculations obtained from minisatellite data...... 136 B.3 Distribution of minisatellite haplotype diversities obtained from the permutation procedure under a random distribution of isolates to populations...... 137 B.4 Results from rejection sampling procedure in RS...... 138 B.5 Validation results for rejection sampling procedure...... 139

C.1 Enrichment of genic and non-genic SNPs for values of |δ| between pairs of populations.... 159

xv C.2 Enrichment of genic and non-genic SNPs for values of δ between selected pairs of populations.160 C.3 Enrichment of genic and non-genic SNPs for values of δ between the HGDP Yoruba and HapMap Tuscans...... 161

C.4 Extreme values of SNP differentiation over all SNPs vs. mean FST for all pairs of populations.162 C.5 P-values for the 5 most extreme 100 kb genomic windows according to the iHS statistic in each population...... 163 C.6 P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH CEU statistic in each population...... 164 C.7 P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH YRI statistic in each population...... 165 C.8 P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH KHB statistic in each population...... 166

C.9 Number of shared 100 kb genomic windows in the top empirical 0.1% vs. mean FST for pairs of populations...... 167 C.10 Number of shared 100 kb genomic windows in the top empirical 1% vs. correlations of window statistics over all genomic windows for pairs of populations...... 168 C.11 Correlations of window statistics over all 100 kb genomic windows vs. average Bantu ances- try for pairs of populations...... 169 C.12 Demographic model simulated using msms...... 170 C.13 Number of overlapping 100 kb windows in the empirical tails of XP-EHH for pairs of popu- lations simulated under neutrality...... 171 C.14 Number of overlapping 100 kb windows in the empirical tails of iHS for pairs of populations simulated under neutrality...... 172 C.15 Number of overlapping 100 kb windows and true positive rate in the empirical tails of XP- EHH for pairs of populations simulated under recent selection...... 173 C.16 Number of overlapping 100 kb windows and true positive rate in the empirical tails of iHS for pairs of populations simulated under recent selection...... 174 C.17 Number of overlapping 100 kb windows and true positive rate in the empirical tails of XP- EHH for pairs of populations simulated under less recent selection...... 175 C.18 Number of overlapping 100 kb windows and true positive rate in the empirical tails of iHS for pairs of populations simulated under less recent selection...... 176

xvi 1 Introduction

Organismic evolution involves both selective and neutral forces, although their relative contributions are often unknown. Accurate interpretation of genetic data requires a consideration of the two, their interaction, and their underlying causes. In this thesis, I use novel statistical methods to analyze genetic data from a variety of organisms, including yeast, Mycobacterium tuberculosis, and humans. The chapters of this thesis provide complimentary perspectives on the relative roles of selection and neutrality, from the molecular to the population level, and present various statistical tools for genetic data analysis. Natural selection acts on phenotypes, which are shaped by complex interactions among genes and be- tween genes and the environment. Understanding how these interactions affect phenotypes can improve our understanding of selection. Chapter2 of this thesis, On the Classification of Genetic Interactions, proposes a new method with which to classify and identify interactions, or epistasis, between pairs of genes. Given a set of single and double mutant phenotypes, various mathematical definitions of epistasis have been used in population genetics. We suggest that various models of genetic interaction can be considered simultaneously in a genetic system, and develop a method to classify genetic interactions using a likelihood framework and model selection criteria. Our method has high accuracy in identifying genetic interactions of various types, and when applied to experimental data collected in yeast (), identifies novel genetic interactions. Our method provides an alternative approach for studying epistasis, providing another view of how selection may act on phenotypes. Population-level studies, incorporating genetic data from a large number of members of a species, can provide a wealth of information about that species’ evolutionary history. While understanding natural selec- tion is often an aim of such studies, demographic analyses, especially in the case of pathogens, can be equally enlightening. In Chapter3, Dispersal of Mycobacterium tuberculosis via the Canadian fur trade, we study genetic data from Mycobacterium tuberculosis, the obligate pathogen that causes human tuberculosis (TB). With a goal of understanding its spread, we examine M. tuberculosis isolated from several contemporaneous Aboriginal Canadian populations which experienced severe TB epidemics in the late 19th century. We use a variety of computational approaches to compare genetic samples of M. tuberculosis between populations and develop an Approximate Bayesian Computation procedure to estimate parameters of a demographic expan- sion of M. tuberculosis coincident with epidemics. In so doing, we find support for spread of M. tuberculosis to indigenous communities via the fur trade in the 18th and 19th centuries. In our analysis of M. tuberculo- sis migration, we are able to shed light on the dynamics of pathogen spread and the effect of a pathogen’s evolutionary history on its host. Along with demographic forces, natural selection also leaves its signature on the genetics of populations. Identifying putative signatures of selection using computational methods applied to genome-wide data can

1 2 CHAPTER 1. INTRODUCTION

generate hypotheses about a population’s underlying selective pressures. In Chapter4, Limited Evidence for Classic Selective Sweeps in African Populations, we search for signatures of selection in the genomes of 12 diverse African human populations. First, to interpret the genome-wide scans, we compare genomic regions identified to be under selection between populations. Using novel statistical approaches, we find evidence for false positive signals of selection, the presence of which may bias inter-population comparisons if demographic history is not thoroughly considered. In Chapter4, we also develop a novel permutation approach to identify biological functions that are significantly associated with regions of the genome under selection. Improving upon previous methods, we account for a number of genomic factors that may have biased the conclusions of previous analyses. While we identify several interesting signals of selection in African populations, our analyses underscore the importance of accurately specifying null models and neutral evolutionary history when searching for signatures of selection. Finally, together with genetic data, data on phenotypic variation can provide insight into the molecular basis and evolutionary history of complex phenotypes. In Chapter5, Evolution of Height and Skin Pigmen- tation in the 6=Khomani Bushmen, we study a phenotypically and genetically diverse KhoeSan population of the Kalahari Desert of South Africa, believed to be one of the world’s most ancient human populations. Using phenotypic data from over 100 individuals and corresponding genetic data from almost 90 individuals, we perform statistical analyses to characterize the genetic basis of their light skin pigmentation and short stature. We find that continental ancestry strongly affects skin pigmentation variation, and in a genome-wide association study, identify several loci of large effect that contribute to variation in stature in the 6=Khomani Bushmen. While we do not arrive at definitive conclusions regarding the evolutionary history of human pig- mentation and height, our study demonstrates the utility of studying phenotypic and genetic data from diverse endogamous human populations. In conclusion, these studies emphasize that a more complete understanding of the evolutionary history of humans and other organisms requires not only the consideration of neutral and selective processes, but also both phenotypic and genetic information. The statistical methods and approaches presented in the follow- ing chapters have the potential to improve inferences of natural selection and demography from population genetic data, as well as provide insight into the relative roles of both. Development of the likelihood framework for identifying and classifying epistatic interactions in Chapter 2 was performed by both myself and Dr. Hong Gao, my collaborator on the project. Dr. Gao coded and implemented the method, and carried out the simulations and associated analyses to test the method’s per- formance. I applied our method to previously-published datasets of yeast phenotypes, validated the method in these datasets, and carried out subsequent interpretations of analyzed data. Dr. Gao and I co-wrote and revised the manuscript. Each of us wrote the sections of the manuscript relevant to our work, and with input from Dr. Marc Feldman I primarily wrote the Introduction and Discussion. All analyses and interpretations of Chapter3 were performed by both myself and Dr. Caitlin Pepperell, my collaborator on the project. Data was collected by Dr. Pepperell and other collaborators. I performed a majority of the statistical analyses of M. tuberculosis genetic data: performing simulations to calibrate a 3

mutation rate, calculating all genetic timing estimates, calculating measures of diversity, implementing and analyzing the Approximate Bayesian Computation procedure for demographic modeling, and calculating genetic estimates of migration. Other statistical analyses of genetic networks, Analysis of Molecular Variance (AMOVA), the map of Figure 3.1, and all historical analyses were performed by Dr. Pepperell. I wrote the sections of the Methods, Supplemental Methods, Results, and Supplemental Results that were relevant to my own work. While both Dr. Pepperell and I interpreted all results, Dr. Pepperell wrote a majority of the main manuscript, with substantial input from me. In Chapter4, I performed almost all statistical analyses, with input from Dr. Brenna Henn and other co- authors. Preliminary calculations of XP-EHH were performed by Chris Gignoux, and iHS calculations were performed by Dr. Jeffrey Kidd. Dr. Kidd also provided several scripts and analysis tools, including those for making summary plots of genomic regions under selection colored by p-values and annotated by their genes. Dr. Henn provided assistance regarding interpretation of results. I wrote the manuscript with input from all co-authors; Dr. Henn and Dr. Feldman also provided extensive edits. In Chapter5, the analyzed data were collected by a number of individuals over a number of years. Sam- ples were collected by Dr. Brenna Henn and Dr. Gignoux in 2006; additional samples and phenotypic measurements were collected by me (2010 and 2011), Dr. Henn (2010), Dr. Marlo Möller (2011), and Dr. Cedric J. Werely (2011) of the University of Stellenbosch. Interviews for age verification on a subset of individuals were conducted by Justin Myrick (2012). DNA extraction was performed by individuals in the group of Dr. Eileen Hoal. I performed all analyses, with the exception of identity-by-descent analyses for estimation of relatedness (performed by Dr. Henn). I wrote the entire chapter, with input from Dr. Henn and Dr. Feldman. 2

On the Classification of Epistatic Interactions

Hong Gao∗†, Julie M. Granka§†, Marcus W. Feldman§

* Department of Genetics, Stanford University, Stanford, CA, USA § Department of Biology, Stanford University, Stanford, CA, USA † Equal contributions to this work.

Genetics. 2010 Mar; 184(3):827-37.

Abstract

Modern genome-wide association studies are characterized by the problem of “missing heritability.” Epis- tasis, or genetic interaction, has been suggested as a possible explanation for the relatively small contribution of single significant associations to the fraction of variance explained. Of particular concern to investigators of genetic interactions is how to best represent and define epistasis. Previous studies have found that the use of different quantitative definitions for genetic interaction can lead to different conclusions when con- structing genetic interaction networks and when addressing evolutionary questions. We suggest that instead, multiple representations of epistasis, or epistatic “subtypes,” may be valid within a given system. Selecting among these epistatic subtypes may provide additional insight into the biological and functional relationships among pairs of genes. In this study, we propose maximum likelihood and model selection methods in a hypothesis-testing framework to choose epistatic subtypes which best represent functional relationships for pairs of genes based on fitness data from both single and double mutants in haploid systems. We gauge the performance of our method with extensive simulations under various interaction scenarios. Our approach performs reasonably well in detecting the most likely epistatic subtype for pairs of genes, as well as in re- ducing bias when estimating the epistatic parameter (). We apply our approach to two available datasets from yeast (Saccharomyces cerevisiae), and demonstrate through overlap of our identified epistatic pairs with experimentally-verified interactions and functional links that our results are likely of biological significance in understanding interaction mechanisms. We anticipate that our method will improve detection of epistatic interactions and will help to unravel the mysteries of complex biological systems.

4 2.1. INTRODUCTION 5

2.1 Introduction

Understanding the nature of genetic interactions is crucial to obtaining a more complete picture of complex biological systems and their evolution. The discovery of genetic interactions has been the goal of many researchers studying a number of model systems, including but not limited to Saccharomyces cerevisiae, , and Escherichia coli [38, 39, 60, 65, 131, 197, 235, 245, 255, 279, 312, 316]. Re- cently, high throughput-experimental approaches, such as epistatic mini-array profiles (E-MAPs) and genetic interaction analysis technology for E. coli (GIANT-coli), have enabled the study of epistasis on a large scale [49, 50, 240, 241, 282]. However, it remains unclear whether the computational and statistical methods currently in use to identify these interactions are indeed the most appropriate. The study of genetic interaction, or “epistasis,” has had a long and somewhat convoluted history. Bateson [16] first used the term “epistasis” to describe the ability of a gene at one locus to “mask” the mutational influence of a gene at another locus [54]. The term “epistacy” was later coined by Fisher [78] to denote the statistical deviation of multilocus genotype values from an additive linear model for the value of a phenotype [205, 206]. These origins are the basis for the two main current interpretations of epistasis. The first, as introduced by Bateson [16], is the “biological,” “physiological,” or “compositional” form of epistasis, concerned with the influence of an individual’s genetic background on an allele’s effect on phenotype [47, 54, 176, 205, 206]. The second interpretation, attributed to Fisher, is “statistical” epistasis, which in its linear regression framework places the phenomenon of epistasis in the context of a population [176, 206, 292, 293, 302]. Each of these approaches is equally valid in studying genetic interactions; however, confusion still exists about how to best reconcile the methods and results of the two [9, 54, 158, 176, 205, 206]. Aside from the distinction between the statistical and physiological definitions of epistasis, there exist inconsistencies when studying solely physiological epistasis. For categorical traits, physiological epistasis is clear as a “masking” effect. When non-categorical or numerical traits are measured, epistasis is defined as the deviation of the phenotype of the multiple mutant from that expected under independence of the underlying genes. The “expectation” of the phenotype under independence, that is, in the absence of epistasis, is not defined consistently between studies. For clarity, consider epistasis between pairs of genes, and without loss of generality, consider fitness as the phenotype. The first commonly-used definition of independence, originating from additivity, defines the effect of two independent mutations to be equal to the sum of the individual mutational effects. A second, motivated by the use of fitness as a phenotype, defines the effect of the two mutations as the product of the individual effects [64, 69, 206]. A third definition of independence has been referred to as “Minimum,” where alleles at two loci are independent if the double mutant has the same fitness as the less-fit single mutant. Mani et al. [165] claim that this has been used when identifying pairwise epistasis by searching for synthetic lethal double mutants [59, 197, 198, 278, 279]. A fourth is the “Log” definition presented by Mani et al. [165] and Sanjuan and Elena [236]. The less-frequently used “scaled ” [245] measure of epistasis takes the multiplicative definition of independence with a scaling factor. 6 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

These different definitions of independence are partly due to distinct measurement “scales.” For some traits, a multiplicative definition of independence may be necessary to identify epistasis between two genes, whereas for other traits, additivity may be appropriate [75, 165, 206, 292]. An interaction found under one independence definition may not necessarily be found under another, leading to different biological conclusions [165]. Mani et al. [165] suggest that there may be an “ideal” definition of independence for all gene pairs for identifying functional relationships. However, it is plausible that different representations of independence for two genes may reflect different biological properties of the relationship [146, 229]. “Two categories of general interest [the additive and multiplicative definitions, respectively] are those in which etiologic factors act interchangeably in the same step in a multistep process, or alternatively act at different steps in the process,” [229, p. 468]. In some cases, the discovery of “epistasis” may merely be an artifact of using an incorrect null model [146]. It may be necessary to represent “independence” differently, resulting in different statistical measures of interactions, for different pairs of genes depending on their functions. Previous studies have suggested that different pairs of loci may have different modes of interaction and have attempted to sub-classify genetic interactions into regulatory hierarchies and mutually exclusive “inter- action subtypes” to elucidate underlying biological properties [8, 65, 255]. We suggest that epistatic rela- tionships can be divided into several “subtypes,” or forms, corresponding to the aforementioned definitions of independence. As a particular gene pair may deviate from independence according to several criteria, we do not claim that these subtypes are necessarily mutually exclusive. We attempt to select the most likely epistatic subtype which is the best statistical representation of the relationship between two genes. To further sub-classify interactions, epistasis among deleterious mutations can take one of two commonly used forms: positive (equivalently; alleviating, antagonistic, or buffering) epistasis, where the phenotype of the double mutant is less severe than expected under independence; and negative (equivalently; aggravating, synergistic, or synthetic), where the phenotype is more severe than expected [49, 64, 165, 245]. Another objective of such distinctions is to reduce the bias of the estimator of the epistatic parameter (), which measures the extent and direction of epistasis for a given gene pair. Mani et al. [165], assuming that the overall distribution of  should be centered around 0, find that inaccurately choosing a definition of independence can result in increased bias when estimating . For example, using the minimum definition results in the most severe bias when single mutants have moderate fitness effects, and the additive definition results in the largest positive bias when at least one gene has an extreme fitness defect [165]. Therefore, it is important to select an optimal estimator for  for each pair of genes from among the subtypes of epistatic interactions. Epistasis may be important to consider in genomic association studies, as a gene with a weak main effect may only be identified through its interaction with another gene or other genes [55, 56, 83, 174, 175]. Epistasis has also been studied extensively in the context of the evolution of sex and recombination. The mutational deterministic hypothesis proposes that the evolution of sex and recombination would be favored by negative epistatic interactions [76, 144]; many other studies have also studied the importance of the form 2.2. METHODS 7

of epistasis [38, 64, 69, 141, 163, 196]. Indeed, according to Mani et al. [165, p. 3466], “the choice of definition [of epistasis] alters conclusions relevant to the adaptive value of sex and recombination.” Given fitness data from single and double mutants in haploid organisms, we implement a likelihood method to determine the subtype which is the best statistical representation of the epistatic interaction for pairs of genes. We use maximum likelihood estimation and the Bayesian Information Criteria (BIC) [244] with a likelihood ratio test to select the most appropriate null or epistatic model for each putative interac- tion. We conduct extensive simulations to gauge the performance of our method, and demonstrate that it performs reasonably well under various interaction scenarios. We apply our method to two datasets with fit- ness measurements obtained from yeast [131, 255], whose authors assume only multiplicative epistasis for all interactions. By examining functional links and experimentally-validated interactions among epistatic pairs, we demonstrate that our results are biologically meaningful. Studying a random selection of genes, we find that minimum epistasis is more prevalent than both additive and multiplicative epistasis, and that the overall distribution of  is not significantly different from zero (as Jasnos and Korona [131] suggest). For genes in a particular pathway, we advise selecting among fewer epistatic subtypes. We believe that our method of epistatic subtype classification will aid in understanding genetic interactions and their properties.

2.2 Methods

We consider epistasis among L genes (represented by g1, g2,..., gL) given the phenotypes of each of the possible single and double mutant haplotypes, each with a given number of replicates. The phenotype we study is fitness scaled by wildtype, obtained either through experiments or simulation [49, 50, 131, 240,

241, 245, 255]. The fitness of the wildtype genotype g0 is 1, and the relative fitnesses of single mutants gi

(1 ≤ i ≤ L) and double mutants gij (1 ≤ i < j ≤ L) are denoted by wi and wij, respectively. Supposing that we have N replicates of fitness values for each mutant, the superscript n represents measurements in the 2 2 n-th replicate (1 ≤ n ≤ N). We denote by σi and σij the variances in fitnesses of single mutants gi and double mutants gij, respectively. We define ij as the epistatic coefficient of a genetic interaction between genes gi and gj.

2.2.1 Epistatic Subtype Selection

Pairwise physiological epistasis for measured numerical traits is defined as the deviation of the observed phenotype of the double mutant from the phenotype expected when the alleles at each locus act independently. As mentioned above, the additive and multiplicative models are widely used to represent independence, while the minimum model is used by investigators searching for synthetic lethal double mutants [165]. Thus, we focus predominantly on these three models, as do Mani et al. [165]. The “Log” model is practically equivalent to the multiplicative model for deleterious mutations [165]; since we generally focus on deleterious mutations, we do not consider this model. We do not include a model based on the “scaled ” measure of epistasis [245], as it was found to be less meaningful than the multiplicative model in describing interactions [255]. 8 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

We thus use six models: the three models of independence (additive, multiplicative, and minimum) and the three models (“subtypes”) of epistasis (additive, multiplicative, and minimum epistasis), which incorpo- rate a deviation from the independence models. We have three major questions. First, when data follow one of the three models of independence, do we incorrectly infer the presence of epistasis? Second, can we detect epistasis between two genes if they interact? Third, can we accurately determine which “subtype” is the most appropriate for a particular pair of genes if they interact?

The three null models are as follows. If genes gi and gj act additively, i.e., there is no additive epistasis, n the null model for the double mutant fitness in the n-th experiment (wij) is a normal distribution with mean n n 2 n wi + wj − 1 and variance σij. If gi and gj act multiplicatively, wij is a normal random variable with mean n n 2 n wi wj and variance σij. The third model specifying the absence of epistasis assumes that wij is normal with n n 2 mean min(wi , wj ) and variance σij. We emphasize that the minimum model described here is a model of independence.

In the presence of additive epistasis ij, the probability distribution of the double mutant fitness wij in n n 2 the n-th experiment, given wi , wj , and variance σij, is:

n n n 2 n n 2 P(wij = x | wi , wj , ij, σij) ∼ N(wi + wj − 1 + ij, σij) (2.1)

n When gi and gj exhibit multiplicative epistasis ij, wij has probability distribution:

n n n 2 n n 2 P(wij = x | wi , wj , ij, σij) ∼ N(wi wj (1 + ij), σij) (2.2)

This form of multiplicative epistasis is slightly different from that of other authors, who suggest wij = wiwj + ij [131, 165, 245, 255]; our formulation is simply a rescaling of this formula. Our motivation for Equation 2.2 is that it relates more closely to the large body of research representing the fitness of an −sk1+ individual with k mutations as wk = e [64, 69, 196, 302]. n In the case of minimum epistasis, wij is also distributed normally:

n n n 2 n n 2 P(wij = x | wi , wj , ij, σij) ∼ N(min(wi , wj ) + ij, σij) (2.3)

An epistatic “subtype” refers to one of the three epistatic models shown in the equations above. To calculate the likelihood of the observed double mutant fitnesses, we assume that each pair of genes is independent of all other pairs, i.e., there are no interactions of higher order. The log likelihood of the data based on N replicates is

PN P n n n 2 log L = log PE(wij = x | wi , wj , ij, σij) ∗ I{gi interacts with gj } n=1 (i,j):1≤i

where I{Φ} is an indicator function and is equal to one if Φ is true and zero otherwise, and PE and PN are the probabilities (given above) under the epistatic and null models, respectively. Since we assume that each 2.2. METHODS 9

pair of genes is independent of all other pairs, we study each separately. We use the Bayesian Information Criterion (BIC) [244], a commonly used model selection criterion, to determine the most likely interaction 2 subtype, if any, of each pair. To calculate the BIC, we first estimate the parameters ij and σij using Maximum 2 Likelihood Estimation (MLE). For additive epistasis, the ML estimators (MLEs) for ij and σij are

ˆ(a) = w − w − w + 1 ijMLE ij i j (a) (2.5) σˆ2 = 1 PN (wn − wn − wn + 1 − ˆ(a) )2, ijMLE N n=1 ij i j ijMLE where w· denotes the mean fitness of genotype g· over all N replicates, and the right superscript (a) indicates that the estimator is for additive epistasis. The MLEs for multiplicative and minimum epistasis are given in AppendixA. We select the model that is the best statistical representation of the interaction for each pair of genes using the BIC, defined as

BIC = n log(N) − 2 log L p M (2.6) ˆ2 = np log(N) + N(1 + log(2πσ ijMLE )), where LM and np are the maximum likelihood and the number of free parameters, respectively, in the given 2 model. For the three null models, np is one, as only σ ij is estimated; for the three epistatic models, np = 2, 2 as both σ ij and ij are estimated. The model with the smallest BIC is selected as optimal, rewarding models with a higher likelihood but penalizing for additional parameters. For each pair, we perform a likelihood ratio test (LRT) of the epistatic model with the lowest BIC against its corresponding null model as an optional step to assess the significance of epistasis. The LRT statistic is given as

λ = −2[log LMnull − log LMepistatic ]. (2.7)

Under the null distribution, λ is approximately χ2 distributed with one degree of freedom (for the one addi- tional parameter  estimated in the epistatic model). When identifying epistasis for many gene pairs, the problem of simultaneously testing multiple hypothe- ses arises. We consider three commonly-used multiple-testing corrections, which use the p-values obtained from the LRT above. The first is the Bonferroni correction, which sets the chosen significance level divided by the number of tests (in our case, the number of putatively epistatic gene pairs) as a p-value cutoff. This is generally considered to be the most conservative procedure to control the family-wise error rate (the proba- bility of rejecting at least one true null hypothesis among all tests). The second multiple-testing correction, proposed by Benjamini and Hochberg [26], controls the false discovery rate (FDR, or the expected proportion of null hypotheses that are falsely rejected). Third, we control the positive false discovery rate (pFDR), which is slightly more lenient than the FDR procedure, as suggested by Storey [258]. We advocate determining sig- nificance using the FDR, which is intermediate between the other two procedures (Sections 2.3.1 and 2.4). We implement our methods in R (http://www.r-project.org, code available upon request). 10 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

2.2.2 Generation of Simulated Data

We carried out extensive simulations to assess the performance of our method. For simplicity, we assume that

each mutation is deleterious, i.e., the fitnesses of single mutants cannot be greater than that of wild type (w0 = 1). (In AppendixA and Table A.1, we consider relaxing this assumption to include beneficial mutations).

We sample the fitnesses of single mutants gi and gj from a truncated normal distribution on [0, 1] with mean 2 2 µ and standard deviation σ, i.e., wi ∼ N(µi, σi )[0,1] and wj ∼ N(µj, σj )[0,1]. Although many deletion strains found in nature have a wildtype fitness, we simulate with this support in order to encompass the entire range of fitness values possible for deleterious mutants. We choose the support for double mutant fitnesses to be [0, 1.2]; since both synergistic and antagonistic epistasis may be possible, the fitness of double mutants may be greater than 1. We simulate wij given wi and wj following the definitions presented in Section 2.2.1. 2 For example, if genes gi and gj act additively and independently, wij ∼ N(wi + wj − 1, σij)[0,1.2]; under 2 additive epistasis, wij ∼ N(wi + wj − 1 + ij, σij)[0,1.2]. Distributions under multiplicative and minimum null and epistatic models are given in AppendixA.

2.2.3 Analysis of Experimental Data

We analyze two datasets of fitness measurements in yeast (Saccharomyces cerevisiae)[131, 255]. In addition to choosing the appropriate epistatic subtype based on the BIC, we perform a LRT to test epistatic models against their corresponding null models and correct for multiple testing with the three methods described in Sections 2.2.1. For all tests, we select an α significance level of 0.05. The first dataset of St Onge et al. [255] included 26 non-essential genes known to confer resistance to the DNA-damaging agent methyl methanesulfonate (MMS), and contained double-deletion strains for all pairs that were possible to construct (323 double mutant strains). The authors measured the fitnesses of single and double mutants in media both with and without MMS. For both media types, while single mutants have approximately 10 replicate fitness measurements, double mutants have only two. Since our method assumes the same number of replicates for single and double mutants, we perform our method with two sampled single mutant fitness measurements. We repeat this sampling 1000 times, and select an epistatic subtype for the pairs that were chosen as epistatic in at least 900 out of the 1000 replicates (see AppendixA for details). In the second study, Jasnos and Korona [131] performed 639 random crosses among 758 randomly se- lected gene deletions to create a series of double mutants and assayed the growth fitnesses of each single and double mutant. As an equal number of replicates are available for both single and double mutants (n ≈ 5), we apply our method to this dataset directly. We perform several analyses to validate the use of our method. First, we examine the number of epistatic pairs sharing “specific” functional links, a term used by both St Onge et al. [255] and Tong et al. [279]. St Onge et al. [255] define these to be genes that share Gene Ontology (GO) terms [270] that have associated with them fewer than 30 genes; Tong et al. [279] define the cutoff for specificity to be less than or equal to 200 associated genes. We examine our data using these and several intermediate cutoffs. Second, we 2.3. APPLICATION TO SIMULATED DATA 11

count epistatic pairs sharing the broader GO Slim terms (obtained from the Yeast Genome Database, http: //www.yeastgenome.org, accessed 2009-09-11). To assess the significance of the number of specific functional links or shared GO Slim terms among inferred epistatic pairs, we implement both Fisher’s Exact Test and a permutation procedure (Table A.6), which produce comparable results. Third, we examine overlap of inferred epistatic pairs with experimentally-verified physical interactions in yeast (BIOGRID database, [257], http://www.thebiogrid.org, Release 2.0.56, see AppendixA).

2.3 Application to Simulated Data

2.3.1 Estimation of Epistatic Measures

We perform 10,000 simulations following Section 2.2.2 under each of the six models (three null models and three epistatic models) for each set of parameters shown in Table 2.1. For each simulation, we sample 50 fitness replicates (N = 50), obtain the BIC for each of the six models, and select the model with the lowest BIC. We compute accuracy as the proportion of simulations in which the model under which the data are simulated is selected (see also Tables A.1, A.2, and A.3). Table 2.1 demonstrates that our method can select the correct null or epistatic model for various parameter combinations. When simulating under one of the three null models, our method identifies the correct model with almost 100% accuracy. Overall, the minimum null model is identified with higher accuracy than either the additive or multiplicative null models. Identification of the minimum epistatic subtype is consistently the most accurate (near 100%), while the accuracy of selecting the additive and multiplicative epistatic subtypes varies with the epistatic coefficient ij. If two single mutants have the same mean fitness, our method is less able to detect multiplicative epis- tasis than either additive or minimum epistasis (Table 2.1). Introducing a fairly large difference between single mutant fitness values slightly increases the accuracy of selecting multiplicative epistasis when data are simulated under this definition. For all epistatic models, the probability of selecting the appropriate model generally increases with the absolute value of ij: the larger the absolute value of ij, the larger is the differ- ence between the expected double mutant fitness under the correct model and the other models. We also have little ability to consistently distinguish between the additive and multiplicative models. See also Figures A.1 and A.2. Testing hypotheses with the LRT and FDR procedure [26] gives nearly identical accuracy to that described above (results not shown). With data simulated under the null model, as expected, the FDR slightly decreases the false positive rate; for data simulated under an epistatic model, power decreases by at most 1%. Although the false positive rate decreases when using the Bonferroni correction, we obtain slightly lower power than with the FDR; the pFDR [258] does not greatly improve power over the FDR (results not shown). The FDR procedure thus seems preferable. 12 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

2.3.2 Bias and Variance of the Epistatic Parameter ()

Mani et al. [165] show that using different definitions of independence results in different estimators of the epistatic parameter (), some with larger biases and variances than others. To evaluate the bias and variance of

ˆMLE estimated with our method, we perform simulations and compute its mean squared error (MSE) under each epistatic model. Because the FDR procedure does not greatly affect the accuracy of our inference (see n 2 above), we do not implement it in these simulations. For the n-th simulation, MSE = (ˆMLE −n) , where th n is the true value used in the n simulation and ˆMLE is estimated under the specific epistatic model. If the simulated |  | is large, the MSE of ˆMLE is the smallest and is nearly zero when the true epistatic model is selected (Figure A.3). When the simulated |  | decreases, we find that while the MSE is always small when the correct model is selected, at times the MSE is even lower under an incorrect model (Figure A.3). In general, our method seems to reduce the bias and variance of .

2.4 Application to Experimental Data

Epistasis has been studied extensively in yeast (Saccharomyces cerevisiae), a model organism for which generating a large number of mutants and examining growth phenotypes is experimentally tractable. Of interest are studies measuring single and double mutant fitnesses on a continuous scale. With our proposed method, we re-analyze two datasets by detecting epistasis and its subtype for each pair of mutants [131, 255].

2.4.1 St Onge et al. [255] Dataset

St Onge et al. [255] examined 26 non-essential genes known to confer resistance to MMS, constructed double- deletion strains for 323 double mutant strains (all but two of the total possible pairs), and assumed the mul- tiplicative form of epistasis for all interactions (Section 2.2.3). Following these authors, we focus on single and double mutant fitnesses measured in the presence of MMS. (For results in the absence of MMS, see Table A.4.) Using the resampling method described in Section 2.2.3 and AppendixA, 222 gene pairs pass the cutoff of having epistasis inferred in at least 900 out of 1000 replicates. This does not include five synthetic lethal gene pairs. Hypothesis testing and a multiple-testing procedure (for 222 simultaneous hypotheses) are necessary to determine the final epistatic pairs. To select one among the three multiple-testing procedures, we follow St Onge et al. [255] and examine gene pairs that share “specific” functional links (see Section 2.2.3). The Bonferroni method is likely too conservative, yielding only 25 significantly epistatic pairs with only one functional link among them; alter- natively, the pFDR procedure appears to be too lenient in rejecting independence for all 222 pairs. Therefore, we use the FDR procedure (although the number of functional links is not significant) and detect 193 epistatic 2.4. APPLICATION TO EXPERIMENTAL DATA 13

pairs, of which 5 (2.6%) are synthetic lethal, 19 (9.8%) have additive epistasis, 33 (17.1%) have multiplica- tive epistasis, and 136 (70.5%) have minimum epistasis (Table 2.2, Table A.4). We find 29 gene pairs with positive (alleviating) epistasis, and 159 gene pairs with negative (aggravating) epistasis. To further validate the use of our method and the FDR procedure, we assess by Fisher’s Exact Test the significance of an enrichment of both Biological Process and all GO Slim term links among epistatic pairs, neither of which are significant ([270]; http://www.yeastgenome.org; Table 2.3). Our discov- ered epistatic interactions do overlap with experimentally-verified physical interactions (BIOGRID database ([257]); Tables 2.3, A.5, and A.6). Although some of the previously unidentified interactions that we identify could be false positives, many are likely to be new discoveries. The epistatic subtypes we consider are not necessarily mutually exclusive. To more fully assess the as- sumptions of our method, we also consider several of the possible subsets of the epistatic subtypes (and their corresponding null models) in our procedure. As the minimum epistatic subtype was the most frequently selected in this data set, we first do not include the minimum null model nor the minimum epistatic model in our procedure (i.e., we select from among four rather than six models for a pair; Table 2.3, column 3). In Table 2.3, we also examine our method’s performance when it selects from only additive epistasis or indepen- dence (column 4), only multiplicative epistasis and independence (column 5), or only minimum epistasis and independence (column 6). Allowing for different subsets of subtypes does not greatly affect the number of inferred epistatic pairs with experimentally identified interactions (Table 2.3, Table A.5). However, there are a significant number of epistatic pairs with functional links only when the minimum epistatic subtype is not included (also see Table A.5). It is not immediately clear which epistatic subtypes are the most appropriate for these data, although including the minimum subtype may not be appropriate [165] (Section 2.5). Although it may be best to consider fewer epistatic subtypes for this specific dataset, we report our results including all three epistatic subtypes and their corresponding null models (Table 3). Our method identifies epistasis for 88 of the same pairs as St Onge et al. [255], although we identify 105 epistatic pairs not identified by the original authors (Figure A.4, Table A.5). St Onge et al. [255] find that epistatic pairs with a functional link have a positively-shifted distribution of epistasis. We find no such shift in epistasis values (Table 2.2, Figure A.5). We also demonstrate (described in Section 2.3.2) that our method seems to reduce bias of the epistatic parameter () (Table 2.2). (Note that in the absence of MMS, we find that  is negative and significantly different from 0; Table A.4.) When considering only a subset of the epistatic subtypes, however, we find  to be positive and significantly different from zero (results not shown). See AppendixA and Figure A.6 for additional discussion of the epistatic pairs we identify.

2.4.2 Jasnos and Korona [131] Dataset

The Jasnos and Korona [131] dataset included 758 yeast gene deletions known to cause growth defects and reports fitnesses of only a sparse subset of all possible gene pairs (≈ 0.2% of the possible pairwise geno- 758 types, or 639 pairs out of 2 ). Because the authors do not identify epistatic pairs in a hypothesis-testing framework, we cannot explicitly compare our conclusions with theirs. 14 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

To validate our method, we examine gene pairs which have “specific” functional links (Section 2.2.3). When defining a functional link using GO terms [270] with fewer than 30 genes associated with them, only 1 of 639 tested gene pairs has a functional link. Raising the threshold of associated genes to 50 and 100, the number of tested pairs with functional links rises only to 3 and 9, respectively. Because of the large number of random genes and sparse number of gene pairs in this dataset, we follow Tong et al. [279] and select GO terms that have associated with them less than or equal to 200 genes. 25 of 639 tested pairs then have a functional link. Only the FDR multiple-testing procedure results in a significant enrichment of functional links among epistatic pairs (Table 2.3); thus, it is likely the most appropriate of the three multiple-testing methods. With the FDR procedure we find 352 significant epistatic pairs, of which 35 (9.9%) have additive epistasis, 63 (17.9%) have multiplicative epistasis, and 254 (72.2%) have minimum epistasis (Table 2.2). These pro- portions of inferred subtypes suggest that the authors’ original restriction to multiplicative epistasis may be inappropriate. We find 141 gene pairs with positive epistasis, and 211 gene pairs with negative epistasis. We do not find a significant number of epistatic pairs with shared GO Slim Biological Process terms (Section 2.2.3), but do when considering all shared GO Slim terms (Table 2.3). However, the significant en- richment of functional links among epistatic pairs suggests that the epistasis found with the BIC and FDR pro- cedures is valid and of biological significance. We also find an overlap of epistatic pairs with experimentally- validated physical interactions from the BIOGRID database). As with the St Onge et al. [255] dataset, we also consider some of the possible subsets of the three epistatic subtypes (and their corresponding null models) in our model selection procedure (Table 2.3). In contrast to the St Onge et al. [255] dataset, using all three epistatic subtypes results in a significant number of epistatic pairs with functional links; this measure is not significant when using any of the other subsets of the subtypes. This suggests that our proposed method with three epistatic subtypes may indeed be the most appropriate for datasets with randomly selected genes. We examined the distribution of the estimated values of the epistatic parameter () for all pairs with sig- nificant epistasis. Jasnos and Korona [131], in assuming only multiplicative epistasis, conclude that epistasis is predominantly positive. However, we find that the estimated mean of epistasis is not significantly different from zero (two-sided t-test, p-value = 0.9578; Figure 2.1 and Table 2.2). This also supports that our method reduces bias of the estimator of . We report in further detail on the epistatic pairs identified in AppendixA and Figure A.7.

2.5 Discussion

We present a likelihood estimation method using the Bayesian Information Criterion (BIC) which, given fitness data from single and double mutants, determines the most appropriate epistatic subtype for pairs of genes that interact. Through extensive simulations, we demonstrate that selection of the epistatic subtype is extremely important, and that our method can indeed do so accurately. We demonstrate through the analysis 2.5. DISCUSSION 15

of two datasets that the epistasis we discover is likely biologically meaningful. Besides the detection of genetic interactions and clarification of their functions, accurate assessment of epistasis is also important for evolutionary theory. For example, the distribution of epistatic effects has important implications for the role of epistasis in the evolution of sex and recombination. We demonstrate that our method reduces bias of the epistatic parameter . Our approach is novel for several reasons. While in reality different subtypes and representations of epistasis may be best for different pairs of loci, classical studies have usually assumed only one subtype of epistasis for all pairs. Our method selects the most likely subtype for pairs of genes, providing a more precise picture of the loci actually exhibiting epistasis and insight into their modes of genetic interaction. Second, some past studies of epistasis have not clearly distinguished between the epistatic and independence models. Jasnos and Korona [131], for example, calculate values of  without formally assessing whether the perceived epistasis is merely statistical noise. With the presented BIC and LRT procedures, we distinguish between the epistatic and null models in a hypothesis-testing framework and address the multiple-hypothesis testing problem with the FDR [26]. Applying our method to two yeast datasets [131, 255], we detect both novel and experimentally-verified epistatic interactions. We believe that an enrichment of “specific” functional links among epistatic pairs is likely the most direct indication of biological importance of inferred epistasis ([255, 279]), as genetic interactions probably occur on a fine biological scale (Section 2.2.3). Including all three epistatic subtypes and their corresponding null models in our procedure results in a significant enrichment of functional links among identified epistatic pairs for the Jasnos and Korona [131] dataset, but not the St Onge et al. [255] dataset (both with and without MMS). Performance for the St Onge et al. [255] dataset is improved when only the additive or multiplicative subtypes, or both, are considered (Table 2.3). This apparent difference between datasets may be due to the fact that the St Onge et al. [255] dataset includes genes known to be involved in conferring resistance to MMS, whereas the Jasnos and Korona [131] dataset measured a random selection of genes and gene pairs. In examining epistasis among genes with similar functions, it thus may be more appropriate to consider fewer epistatic subtypes. This difference between the datasets may also be due to the low sample size for double mutants in the St Onge et al. [255] dataset. We find predominantly minimum epistasis, i.e., a deviation from the minimum null model, when con- sidering all three subtypes in both datasets. This is likely valid for the Jasnos and Korona [131] dataset, but it remains unclear which subtypes are the most appropriate for the St Onge et al. [255] dataset. Mani et al. [165] suggest that the minimum definition of independence may not always be useful. Examining another dataset with randomly selected genes and gene pairs will help to assess the performance of our method. The slight overlap of epistatic pairs inferred using different subsets of the three subtypes emphasizes that the subtypes we select are not necessarily mutually exclusive (Table 2.3). We claim to select the statistically best-fitting epistatic subtype for pairs of genes in order to gain additional insight into their mode of interaction. For example, a double mutant can deviate from both additive and multiplicative independence. Selecting the 16 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

additive epistatic subtype with our method suggests that additive epistasis may be more appropriate and useful way to represent the relationship between the two genes. The different results obtained considering different subtypes again emphasizes the importance of selecting among them [165]. While Jasnos and Korona [131] find a skew towards positive epistasis, we find estimates of epistasis to be centered around zero (Table 2.2). For both datasets, however, the mean of additive epistasis () is positive and significantly different from 0, whereas the mean of minimum epistasis () is negative and significantly different from 0 (Table 2.2). Applying our method to datasets with a larger number of genes may provide a more accurate view of the types and strengths of epistasis often found in nature. Inference about epistatic subtypes is affected by many factors, including the number of replicate ex- periments and the variance of mutant fitnesses. Accuracy clearly increases with the number of replicate measurements, but in practice less than 10 replications are performed [131, 255]. We assess our method’s accuracy when the number of replicates is decreased from 50 to 10 and to 2 (Tables A.2 and A.3). As ex- pected, accuracy decreases, but varies depending on the type of epistasis examined (AppendixA). Our failure to identify epistasis for several experimentally-identified interacting pairs in the St Onge et al. [255] dataset is likely due to our low power with only two double-mutant fitness measurements. A large variance in mutant fitnesses also decreases the accuracy of our method. Our simulations use 0.05 as the standard deviation of mutant fitnesses, which is the typical value found for single and double mutants in the Jasnos and Korona [131] dataset. To assess the effect of variance of fitness on accuracy, we ran simulations with the standard deviation increased to 0.1; this slightly reduces the ability to distinguish the true epistatic model from the others (Table 2.1). 2 We also address the approximation of the distribution of the LRT statistic (λ) to the χ1 distribution under the null hypothesis. For both datasets, very few functional links exist among epistatic pairs inferred with the Bonferroni method. This indicates that epistatic pairs with functional links have somewhat larger p- values, and are only significant under less stringent multiple-testing protocols. The χ2 approximation may not always be appropriate, as the measured fitnesses sometimes deviate from normality in both datasets (AppendixA, Figures A.8 and A.9). This may also explain our inability to detect some experimentally- validated interactions. We urge measuring a large number of replicates to meet the normality requirement and to increase the power of identifying epistatic interactions. Our method is not limited to studying fitness as a phenotype and can be applied to any trait, although it is necessary to scale the phenotype of interest by the wildtype value. Our likelihood-based method is only applicable to studies of experimental data where multiple replicates are measured; it is not immediately applicable to studies of computationally-predicted fitness values, with only one measurement [245]. However, adding random noise terms to each predicted value to create multiple “replicates” could allow our method to be applied in the future. In addition, it is important to note that the epistasis identified with this method is a property of the alleles at each locus; it is conceivable that when examining other pairs of alleles at the tested loci, different results could be obtained. We also note that our method is only applicable to haploid systems. Identifying epistasis and constructing double mutants is much more difficult in diploid systems, as 2.6. ACKNOWLEDGEMENTS 17

more complex statistical issues arise due to dominance and other effects [79, 93, 156, 196, 293]. Since current methods to identify epistasis in haploid systems are still in their infancy, it is important to first develop and perfect these methods before addressing diploid systems. Our method is one in a wide research field studying genetic interactions and in a long history of the study of epistasis [12, 48, 99, 226]. As mentioned above, our method leaves many open questions. Not all of the genetic interactions that we find may translate into physical interactions, and many researchers have pursued the question of how genetic and -protein interactions are related [152, 243, 306]. The translation of discoveries of epistasis into discoveries of functional interactions is a very intriguing, albeit complex, problem. We expect that our method will shed light on intrinsic properties of epistasis and may help to address these broader issues.

2.6 Acknowledgements

We thank Elhanan Borenstein for his helpful comments and suggestions, as well as two anonymous reviewers and the editor of Genetics. This research project is supported by National Institutes of Health (NIH) grant GM28016 to MWF, NIH grant RO1 GM073059 to Hua Tang, and a Stanford Dean’s postdoctoral fellowship to HG. JMG is supported by the Stanford Genome Training Program, which is funded by the NIH (academic year 2008-2009), and by a National Science Foundation (NSF) Graduate Research Fellowship (grant number DGE-0645962) (2009-2010). 18 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

2.7 Figures and Tables 150 100 Frequency 50 0

−1 0 1 2 3

ε

Figure 2.1: Distribution of the epistasis values () for significant epistatic pairs in the Jasnos and Korona [131] dataset, determined using the FDR procedure and the BIC method including all three epistatic subtypes and their corresponding null models. Mean of  is -0.0009, with a standard deviation of 0.3177; median value is -0.0587. A similar figure is shown in Figure 3 of Jasnos and Korona [131]. 2.7. FIGURES AND TABLES 19

Table 2.1: Fractions of simulations that recover the true model (null or epistatic) among the six models (three null models and three epistatic models) with different values of the epistasis coefficient (ij), fitness standard deviation (σ = σi = σj = σij), and mean fitnesses of single mutants (µi) and (µj). (a) (p) (m) (a) (p) (m) σ µi µj PN PN PN ij PE PE PE 0.05 0.8 0.8 -0.3 0.996 0.906 0.929 0.05 0.8 0.8 -0.2 0.992 0.902 0.958 0.05 0.8 0.8 0.986 0.983 0.994 -0.1 0.965 0.873 0.981 0.05 0.8 0.8 0.1 0.739 0.720 0.997 0.05 0.8 0.8 0.2 0.591 0.605 0.996 0.05 0.8 0.8 0.3 0.746 0.610 0.999 0.05 0.9 0.6 -0.3 0.999 0.958 0.978 0.05 0.9 0.6 -0.2 0.995 0.939 0.965 0.05 0.9 0.6 0.988 0.987 0.992 -0.1 0.986 0.901 0.972 0.05 0.9 0.6 0.1 0.864 0.832 0.994 0.05 0.9 0.6 0.2 0.830 0.825 0.999 0.05 0.9 0.6 0.3 0.888 0.821 0.999 0.1 0.8 0.8 -0.3 0.978 0.870 0.915 0.1 0.8 0.8 -0.2 0.961 0.866 0.884 0.1 0.8 0.8 0.955 0.952 0.993 -0.1 0.953 0.758 0.896 0.1 0.8 0.8 0.1 0.754 0.711 0.995 0.1 0.8 0.8 0.2 0.689 0.618 0.997 0.1 0.8 0.8 0.3 0.908 0.517 1.00 0.1 0.9 0.6 -0.3 0.780 0.942 0.946 0.1 0.9 0.6 -0.2 0.947 0.910 0.945 0.1 0.9 0.6 0.973 0.971 0.991 -0.1 0.963 0.744 0.915 0.1 0.9 0.6 0.1 0.833 0.731 0.987 0.1 0.9 0.6 0.2 0.804 0.758 0.994 0.1 0.9 0.6 0.3 0.882 0.781 0.997

(a) PN is the fraction of simulations (simulated under the additive null model) with the additive null model (p) (m) attaining the minimum value of the BIC among the six models. PN and PN are defined in a manner (a) (a) similar to PN for the multiplicative and minimum models, respectively. PE represents the power of our method, i.e. the percentage of simulations (simulated under the additive epistatic model with epistasis coefficient ij) with the additive epistatic model attaining the minimum value of the BIC among the six (p) (m) (a) models. PE and PE are defined in a manner similar to PE for the multiplicative and minimum models, respectively. 50 replicate fitness values are simulated. 20 CHAPTER 2. ON THE CLASSIFICATION OF EPISTATIC INTERACTIONS

Table 2.2: Summary of gene pairs with the indicated epistatic subtypes, inferred using the FDR proce- dure with the BIC method, which considers all three epistatic subtypes and their corresponding null models. Epistatic Subtype Study S Study J 193 (100%) 352 (100%) All  = -0.060  = -0.001

q0.5 = -0.096 q0.5 = -0.059 19 (9.8%) 35 (9.9%) Additive  = 0.115*  = 0.193***

q0.5 = 0.131 q0.5 = 0.188 33 (17.1%) 63 (17.9%) Multiplicative  = 0.048  = 0.017

q0.5 = -0.166 q0.5 = -0.115 136 (70.5%) 254 (72.2%) Minimum  = -0.111***  = -0.032**

q0.5 = -0.091 q0.5 = -0.065 Numbers are the counts of each type, and percentages are given of the total number of epistatic pairs. The

mean () and median (q0.5 ) of the epistatic parameter () are given for each subtype, with “*” indicating that the mean of  is significantly different from 0 (p-value ≤ 0.05; “**” implies p-value ≤ 0.01, “***” implies p-value ≤ 0.001). Study S refers to the St Onge et al. [255] dataset measured in the presence of MMS, and Study J refers to the Jasnos and Korona [131] dataset. (For Study S, 5 of the epistatic pairs are synthetic lethal and are not shown; as a result, percentages do not sum to 100%). 2.7. FIGURES AND TABLES 21 3 M 21 329 243 15 (0.2619) 24 (0.9256) 213 (0.5534) 231 (0.5908) 68 (0.04902)* 213 (0.0003)* 2 P 23 231 171 10 (0.4227) 44 (0.3534) 29 (0.0003)* 153 (0.1825) 162 (0.6997) 146 (0.0273)* 1 A 24 263 247 11 (0.4689) 55 (0.0736) 160 (0.1297) 34 (0.0031)* 237 (0.1472) 223 (0.0010)* 2 22 273 192 A, P Subtypes Considered in BIC Procedure 13 (0.2320) 50 (0.4874) 29 (0.0041)* 182 (0.6926) 172 (0.01654)* 174 (0.03656)* 3 17 352 193 A, P, M 69 (0.1573) 21 (0.6450) 19 (0.0255)* 174 (0.0657) 185 (0.2866) 224 (0.0009)* Study J Study S Number Found (636) Number Found (323) Functional Links (25) Functional Links (36) Comparison of validation measures for each dataset for different variations of the FDR and BIC procedure, considering only GO Slim Terms (All) (369) GO Slim Terms (All) (307) Experimentally Identified (3) Experimentally Identified (29) GO Slim Terms (Biological Process) (115) GO Slim Terms (Biological Process) (283) Table 2.3: a subset of epistaticsubtypes subtypes (A, with P), their only corresponding the null additive (A), models: only all the epistatic multiplicative (P), subtypes or (A, only P, the M), minimum only (M) the subtype (see additive text and for multiplicative details). Numbers in parentheses indicate p-valuesand by Study Fisher’s Exact S Test, refers “*” to indicatestested the significance. pairs St and Study Onge the J et total refers al. number to [ 255 ] of the dataset each Jasnos measured type and in of Korona the [ 131 ] link presence dataset, found of in MMS. each Numbers complete in dataset. parentheses indicate the total number of 3

Dispersal of Mycobacterium tuberculosis via the Canadian fur trade

Caitlin S. Pepperell∗†, Julie M. Granka§†, David C. Alexander∗∗, Marcel A. Behr§§, Linda Chuia, Janet Gordonb, Jennifer L. Guthrie∗∗, Frances B. Jamieson∗∗, Deanne Langlois- Klassenc, Richard Longc, Dao Nguyen§§, Wendy Wobeserd, Marcus W. Feldman§

* Division of Infectious Diseases, Stanford University, Stanford, CA 94305 § Department of Biology, Stanford University, Stanford, CA, USA ** Ontario Agency for Health Protection and Promotion, ON Canada M9P 3T1 §§ Department of Medicine, McGill University, QC Canada H3G 1A4 a Provincial Laboratory of Public Health, AB Canada T6G 2J2 b Sioux Lookout First Nations Health Authority, ON Canada P8T 1B8 c Department of Medicine, University of Alberta, AB Canada T6G 2J3 d Division of Infectious Diseases, Queen’s University, ON Canada K7L 3N6 † Equal contributions to this work.

Proc. Natl. Acad. Sci. USA. 2011 Apr 19; 108(16):6526-31.

Abstract

Patterns of gene flow can have marked effects on the evolution of populations. To better understand the migration dynamics of Mycobacterium tuberculosis, we studied genetic data from European M. tuberculosis lineages currently circulating in Aboriginal and French Canadian communities. A single M. tuberculosis lineage, characterized by the DS6Quebec genomic deletion, is at highest frequency among Aboriginal pop- ulations in Ontario, Saskatchewan, and Alberta; this bacterial lineage is also dominant among tuberculosis (TB) cases in French Canadians resident in Quebec. Substantial contact between these human populations is limited to a specific historical era (1710-1870), during which individuals from these populations met to barter furs. Statistical analyses of extant M. tuberculosis minisatellite data are consistent with Quebec as a source population for M. tuberculosis gene flow into Aboriginal populations during the fur trade era. Historical and genetic analyses suggest that tiny M. tuberculosis populations persisted for ≈100 years among indigenous populations and subsequently expanded in the late 19th century after environmental changes favoring the pathogen. Our study suggests that spread of TB can occur by two asynchronous processes: (i) dispersal of M. tuberculosis by minimal numbers of human migrants, during which small pathogen populations are sustained

22 23

by ongoing migration and slow disease dynamics, and (ii) expansion of the M. tuberculosis population facil- itated by shifts in host ecology. If generalizable, these migration dynamics can help explain the low DNA sequence diversity observed among isolates of M. tuberculosis and the difficulties in global elimination of tuberculosis, as small, widely dispersed pathogen populations are difficult both to detect and to eradicate. 24 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

3.1 Introduction

Migration of human pathogens between host populations is of clear biomedical interest, as global spread of infectious diseases is predicated on pathogen dispersal. For obligate human pathogens, the paths of migration are shaped by social networks among hosts, and, at a broader scale, by human migrations and patterns of comingling. The shape and magnitude of these processes may be inferred from genetic data from both extant populations of humans [216, 219] and micro-organisms carried by humans [173]. Statistical genetic inferences about pathogen migration have been buttressed by a variety of independent data types, such as linguistic [173], epidemiological [286], and ecological [31].

Human tuberculosis (TB) is caused by Mycobacterium tuberculosis (M. tuberculosis). By current esti- mates, M. tuberculosis infects one third of the world’s population, of which a minority progress to disease and account for 9-10 million new, transmissible cases per year [301]. Genetic data from M. tuberculosis are characterized by low DNA sequence diversity [2] and significant population subdivision, both at continental [107, 111] and fine geographic scales [203]. TB is characterized by variable transmissibility, depending on environmental conditions, and irregular disease dynamics [7, 124]. Variation in temporal dynamics of disease is introduced by the phenomenon of clinical latency, whereby a host may be infected with M. tuberculosis for decades before reactivating the infection, becoming ill, and thus able to transmit the infection to oth- ers. These and other aspects of disease ecology would be expected to affect migration of M. tuberculosis among host populations. Broad outlines of M. tuberculosis migratory history have been described based on the geographic range of specific bacterial clades and lineages [107]. However, very little is known about M. tuberculosis migratory patterns and rates, population dynamics during migration, or effects of migratory history on M. tuberculosis population genetic diversity.

In this paper, we present evidence suggesting that M. tuberculosis dispersed into Canadian Aboriginal populations as a result of contact with European fur traders in the 18th century. Although contact between the populations involved small numbers of individuals, it was extensive enough to result in the development of Métis society, of both European and Aboriginal ancestry (see AppendixB for more historical details). Large scale TB epidemics were not evident among Western Canadian Aboriginal populations until the late 19th and 20th centuries [58, 77, 161, 203], suggesting that epidemics may be uncoupled from the process of M. tuberculosis dispersal.

Our analyses of historical, epidemiological, and M. tuberculosis genetic data suggest: 1) M. tuberculosis may be spread by small numbers of human migrants, 2) M. tuberculosis populations can persist at low levels over historical time scales, 3) these small bacterial populations may be sustained by ongoing migration and possibly by slow disease dynamics, and 4) shifts in host ecology favoring the pathogen may be accompanied by bacterial population expansions, with marked effects on genetics of M. tuberculosis populations. 3.2. RESULTS 25

3.2 Results

Results of M. tuberculosis population screening for lineage-defining polymorphisms [88, 185] are shown in Figure 3.1 (and Table B.1). DS6Quebec is the most frequent lineage among bacteria from all four popula- tions: the French Canadian population of Quebec (QU), as well as Aboriginal populations in Ontario (ON), Saskatchewan (SK), and Alberta (AB). Within SK, DS6Quebec was found in all of eight intra-provincial re- gions (fine-scale geographic data available only for this population, see [203] for definitions of within-SK regions). Frequencies of other M. tuberculosis lineages were more variable among populations. A minimum spanning tree, based on 12 minisatellite loci from bacteria with the DS6Quebec polymorphism, reveals a dis- tinctly star-like network structure (Figure 3.2), with little evidence of population differentiation. Networks based on minisatellite data within other common lineages (Rd 182, Rd 219 and H37Rv-like) were not star- like (see Figure B.1). Lineage-specific patterns were also evident in analyses of population genetic structure (Analysis of Molecular Variance, AMOVA, Table 3.1). Overall, a modest level of differentiation was ob- served among populations (FST = 0.08). However, differentiation varied substantially between lineages with Quebec the least within the DS6 lineage (FST = 0.04) and the greatest within the H37Rv-like lineage (FST = 0.52). These results suggest 1) the DS6Quebec lineage was introduced to these populations in the relatively remote past, resulting in a similarly high frequency across populations, 2) M. tuberculosis belonging to the DS6Quebec lineage were, or are, exchanged freely among study populations, 3) the DS6Quebec lineage un- derwent rapid population expansion, reflected in a star-like network topology, possibly coincident with its introduction to new host populations, and 4) there is temporal and/or spatial variation in patterns of bacterial migration among human populations, reflected in variable differentiation within bacterial lineages. Although the QU population is now geographically and socially isolated from ON, SK and AB popula- tions, there is a history of comingling in the context of the Canadian fur trade (Figure 3.1). Table 3.2 outlines the expected history of the DS6Quebec lineage, assuming it was introduced to Canada by a French immigrant to Quebec, and dispersed to Aboriginal populations via fur trade transportation and social networks. Esti- mates derived from M. tuberculosis minisatellite data of the time to most recent common ancestor (TMRCA

[23]) and divergence time between bacterial populations (TD [314]) suggest that genetic exchange between QU and Western Aboriginal M. tuberculosis populations occurred over approximately 100 years; timing of this exchange is consistent with historical human migrations connected with the trade in furs (additional estimates in AppendixB). Levels of overall genetic diversity of M. tuberculosis restriction fragment length polymorphism (RFLP [284]) haplotypes from four populations are shown in rarefaction curves in Figure 3.3A. Published data were available from MB, and are included along with QU, SK and AB; data from ON are not included, due to its small sample size (see AppendixB and Figure B.3). The number of distinct haplotypes is highest in QU, consistent with it being a source population for pathogen migration. Levels of diversity are lowest in SK and MB, with intermediate diversity in AB. Diversity thus does not decrease in an East-to-West pattern, as we might expect with a serial founder effect [216] originating in QU. Post-European-contact epidemics of TB among Canadian indigenous populations have been associated 26 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

with dramatic social, economic and environmental changes that characterized the industrial era (late 19th century and beyond, see AppendixB). There is evidence of regional differences in the pace of these changes: published analyses of archival fur trade documents indicate that some indigenous populations remained re- mote from the cash based industrial economy as late as the 1920s [217]. Regions within SK and MB fall into this historically remote/traditional category, whereas AB does not contain any such regions (Figure 3.1). Fur- ther development of transportation networks (starting with the widespread use of bush planes in the 1930s) later allowed commercial development of previously remote regions [217].

Given associations between industrialization, attendant shifts in host ecology, and TB epidemic expan- sion among indigenous populations (see AppendixB), we hypothesized that M. tuberculosis diversity in SK and MB was low relative to AB as a result of more recent M. tuberculosis population expansion in re- mote/traditional areas. To test this hypothesis, minisatellite and RFLP haplotypes from SK were divided into historically “remote” (RS) and “non-remote” (NRS) regions according to the classification of trading districts outlined in [217]. Consistent with this classification, diversity in NRS is similar to AB (containing no remote regions), with the lowest diversity observed in RS (Figure 3.3B and AppendixB). Permutation procedures identified RS and QU as having lower and greater diversity, respectively, than would be expected under a random distribution of minisatellite haplotypes among populations (see AppendixB).

We analyzed DS6Quebec minisatellite data with rejection sampling, an Approximate Bayesian Computa- tion (ABC) method, to assess whether a demographic model allowing for an expansion in RS (in which the historical bacterial effective population size, Ne, was smaller than the current Ne) is more likely than a null model of constant size. We estimated the parameter ω, the ratio of historical Ne to contemporary Ne (see Methods and AppendixB). Results using this method indicate that the posterior probability of the expansion model (versus the null constant size model) is 1; based on simulations under the null model, the p-value of this posterior probability is 0. We estimate ω to be 0.077 (95% credibility interval 0.012-0.291); see Figure

B.4A. Translating these results to absolute numbers (current Ne= 30) generates an estimate of effective num- ber of TB cases in RS prior to 1930 (the time fixed for the expansion, based on the social history described above) equal to 2.345 individuals (95% credibility interval: 0.351-8.803). See AppendixB for parameter estimation under alternative mutation rate assumptions, all of which reject the constant size model in favor of the expansion model.

Based on the size and dynamics of the population involved in the French (Montreal-based) trade in the West prior to 1870, and prevalence of TB in European cities in the 18th century, we estimated the absolute number of M. tuberculosis infections transmitted to Western indigenous populations (see AppendixB). The historical estimate of Nm, the product of population size (N) and fraction of the indigenous M. tuberculosis population replaced by immigrants (m) per generation is 0.16. In an island model, Nm < 1 results in substan- tial population differentiation due to genetic drift [307]. Estimates of Nm derived from bacterial minisatellite

data were higher than this historical estimate: for pooled lineages, Nm derived from FST (AMOVA) is 6.16, and Slatkin’s private alleles method [15] generated an estimate of 3.62 (empirical 95% CI 2.47-4.72). 3.3. DISCUSSION 27

3.3 Discussion

The recent history of human migration to Western Canada informs analyses of M. tuberculosis genetic data presented here. Most importantly, substantial contact between French Canadian and Western indigenous populations is limited to a specific historical period (the fur trade era). The early bound on this period of contact is provided by the dates of incursion of fur traders into the Western interior (approximately 1710 [120]), while the later bound is provided by historical and demographic analyses indicating that Westward migration of French Canadians ceased in the latter half of the 19th century for economic and other reasons [92]. Based on the observed pattern of M. tuberculosis lineage frequencies in this historical context, as well as bacterial population genetic diversities consistent with QU as a source population, genetic timing estimates, and patterns of genetic differentiation between populations, we infer that the DS6Quebec lineage was dispersed to indigenous populations by French Canadian fur traders, about a century before epidemic forms of TB were manifest in Western Aboriginal communities. Several features of the human migration associated with trade in furs are noteworthy. First, the absolute number of human migrants was small: by our estimate, 5,419 individuals migrated from East to West, during 160 years of trade between Montreal and Western Canadian indigenous populations (see AppendixB and Table B.2). These early migrations are dwarfed by population movements of the late 19th and early 20th centuries. Facilitated by massive inter-provincial and international migration to the Canadian prairies - from which French Canadians were largely absent - the census of Western Canada grew from 110,000 in 1871 to 1,750,000 in 1911 [283]. Between 1900 and 1917, a total of 1,671,414 foreign-born individuals migrated to Western Canada from the UK, the Ukraine, Russia, Germany, Austria, Hungary, Norway, Denmark, China and elsewhere [294]. Despite the enormous number of migrants from regions of high TB incidence, who would be expected to introduce a broad range of Euro-American and East Asian M. tuberculosis lineages [111], bacterial populations in Western indigenous communities are dominated by the DS6Quebec lineage, present at a similar frequency in the French Canadian population of Quebec (Figure 3.1). The apparent lack of M. tuberculosis gene flow from 19th century homesteaders to indigenous popu- lations may be explained by social distance between populations. International migrants to the prairies, particularly those facing language and cultural barriers to assimilation, were socially and geographically iso- lated [294]. By the late 19th century, prairie indigenous populations were also socially segregated, at times forcibly [161, 294], from the non-Aboriginal population. This is in contrast to the earlier fur trade era, charac- terized by inter-marriage and trading collaborations between European immigrants and First Nations groups ([82] and AppendixB). Although contact between populations involved small numbers of individuals, we speculate that close social ties between sending and receiving host populations permitted migration of M. tuberculosis through the fur trade. Given that transmission of M. tuberculosis requires sustained, close con- tact (exemplified by efficient transmission in high density shared living environments such as prisons [124]) this observation is likely to be generalizable, with structure of global M. tuberculosis populations strongly influenced by the social architecture of host populations. 28 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

Although by no means a disease-free interval for Native peoples (there were devastating epidemics of smallpox and other infectious diseases), epidemics of TB were not a feature of the fur trade era (1710- 1870). TB epidemics among Western Canadian indigenous populations occurred later, starting in the late 1800s [58, 77, 124, 161]. Expansion of rail and steamship networks into Western Canada in the late 19th century permitted agricultural development, industrial-scale extraction of natural resources, development of government institutions and mass immigration. TB epidemics were among a chain of sequelae for Aboriginal populations that included displacement, loss of traditional food sources, crowding and institutionalization ([84] and AppendixB). Some indigenous populations (e.g. RS, parts of MB) remained remote from the evolving industrial econ- omy into the 1920s ([217], shaded areas of Figure 3.1). We find that M. tuberculosis genetic diversity in RS is the lowest, and diversity in QU the highest (Figure 3.3); we did not have detailed geographic data that would allow us to classify MB M. tuberculosis strains according to the same scheme as SK (i.e. NRS vs RS). We note that molecular epidemiological studies of other historically remote/traditional regions are consistent with low M. tuberculosis population genetic diversity [145, 186]. Differences in genetic diversity may result from two distinct phenomena. First, the process of M. tubercu- losis migration is likely to involve serial founder events, such that diversity is highest in the source population (QU), and lower in the subsequently founded populations (ON, RS, NRS, AB, MB). Another explanation is that QU, the oldest M. tuberculosis population (founded by 17th - 18th century migrants from France, see Ta- ble 3.2), may be closer to an equilibrium distribution of haplotype frequencies; the youngest populations may still show evidence of a “founder flush” [266]. A historically remote/traditional population, RS was likely shielded from ecological antecedents of epidemic TB until the regional incursion of air travel networks in the 1930s [217], and thus had the most recent expansion in its pathogen population. Demographic modeling with Approximate Bayesian Computation demonstrated a very high probability of expansion in the RS M. tuberculosis population since 1930; magnitude of the bacterial expansion was approximately 13-fold. Although our historical estimates of M. tuberculosis migration rates were low (direct Nm = 0.16), the lack Quebec of genetic differentiation among populations (DS6 FST = 0.04) and estimates of Nm from the genetic data (indirect Nm = 3.62) would imply a high rate of M. tuberculosis migration. Slatkin has observed that indirect (genetic) estimates of gene flow may be higher than direct measurements of population dispersal, especially in populations characterized by colonization-extinction-re-colonization dynamics, which can limit population differentiation even when rates of migration are low [251, 252]. Taken together, our results suggest that absolute numbers of M. tuberculosis migration events were low (Nm < 1, an order of magnitude estimate based on historical data), and patterns of genetic differentiation have been affected by unstable dynamics of small, pre-industrial M. tuberculosis populations. We have delineated two asynchronous processes involved in the spread of TB from European immigrants to Native Canadians (Figure 3.1B). The first is dispersal of the etiologic agent, M. tuberculosis, populations of which were sustained at very low levels (Ne ≈ 2) for approximately 100 years by small numbers of human migrants who had intimate, sustained contact with susceptible hosts. In addition to sustained migration, 3.4. MATERIALS AND METHODS 29

variable transmission dynamics of TB may have cushioned small bacterial populations against extinction [33]. The second process is expansion of the bacterial population, following a shift in host ecology favoring the pathogen. We find evidence of bacterial population expansions in the DS6Quebec M. tuberculosis haplotype network (Figure 3.2), patterns of genetic diversity (Figure 3.3), and coalescent-based demographic analysis of bacterial minisatellite data. This is a study of a specific historical phenomenon, and it is unknown whether the observed patterns of M. tuberculosis migration are applicable to other settings. However, the fact that pathogen migration can be asynchronous with epidemiological phenomena is likely to render control of TB more difficult, and could help explain global persistence of the pathogen despite extensive efforts at eradication. As an obligate human pathogen that requires specific environmental conditions and sustained contact for transmission, barriers to geographic spread of M. tuberculosis are largely social, and therefore somewhat fluid. Perseverance of tiny bacterial populations would permit gradual accrual of a large geographic range despite significant population subdivision. As a result, by the time that the spread of TB is obvious, M. tuberculosis populations may already be well established.

3.4 Materials and Methods

3.4.1 Population Descriptions

Clinical isolates of M. tuberculosis in this study are from five Canadian populations: the French Canadian population of Quebec (QU), and Aboriginal populations in Ontario (ON), Manitoba (MB), Saskatchewan (SK), and Alberta (AB). The QU sample (n = 297) is described in [185]. The ON sample (n = 45) derives from TB cases that occurred between 1997 and 2009 in First Nations communities in a single region of the province (total population approx. 25,000). MB data (n = 163) are from a published study [32]: total number of TB cases, number of different bacterial RFLP haplotypes, number of singletons and the configuration distribution of the five most common M. tuberculosis haplotypes from First Nations reserve communities in MB (1992-1999) are reported. We made the conservative assumption that the remaining (unreported) haplotypes were doubletons. The SK sample (n = 444) is described in Pepperell et al. [203]. AB samples (n = 283) are from TB cases among First Nations individuals in the province, excluding urban areas, from 1990-2008.

3.4.2 Genotyping Methods

Isolates of M. tuberculosis were genotyped by RFLP, based on the number and location of IS6110 elements [284]. The number of tandem repeats at 12 minisatellite loci was also determined for each isolate using a standard methodology [263]. M. tuberculosis lineage typing, based on genomic deletions [88, 185], was done with real-time PCR (details in AppendixB). 30 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

3.4.3 Mutation Rate Estimate

We estimated a mutation rate (µ) per transmission generation [203] for minisatellite loci based on simulations using empirical estimates of intra-host M. tuberculosis population dynamics and in vitro estimates of muta- tions per cell doubling for bacterial minisatellite loci. Given a known per-year RFLP mutation rate [265], we made additional estimates by examining the number of RFLP types per minisatellite haplotype and compar- ing θ (2Nµ) estimates from RFLP and minisatellite data. All estimates were of a similar order of magnitude and were consistent with published estimates [223]; we used 0.001 mutations per locus (see AppendixB and Table B.3).

3.4.4 Statistical Calculations

Genetic differentiation. Analysis of Molecular Variance (AMOVA,implemented in Arlequin version 3.5 [74])

was used to calculate FST values from M. tuberculosis minisatellite haplotype data. Network analysis. BioNumerics 5.0 (Applied Maths, Kortrijk, Belgium) was used to generate minimum spanning trees of M. tuberculosis minisatellite haplotypes. This program implements the Prim-Jarnik al- gorithm; the BURST priority rule maximizing single and double locus variants was used during network searches. Genetic timing estimates. We estimated TMRCA of the DS6Quebec lineage in each population using minisatellite genotypes and the method of Ytime [23]. We selected as the root the haplotype at the center of the DS6Quebec network (Figure 3.2), present at high frequency in QU, SK and AB. We obtained bootstrap confidence intervals with Ytime assuming a constant population size and neutral demography (see Appendix B). Divergence times for DS6Quebec isolates between pairs of populations were estimated from minisatellite

genotypes by the TD estimator, which is robust to population size changes and weak gene flow [314] (see

AppendixB). This procedure requires an estimate of V0, the average variance in repeat number in the ancestral population at the time of divergence; we estimated TD using three different values of V0 for each pair of populations (see Results and AppendixB). Alternative mutation rate and demographic assumptions have little effect on our main conclusions (AppendixB). Population genetic diversity. We assessed diversity of minisatellite and RFLP haplotypes in each popu- lation, for all lineages, by calculating the number of distinct haplotypes using rarefaction [136], calculating haplotype diversity with re-sampling to the lowest sample size to correct for unequal sample sizes [247], and implementing a novel permutation procedure to assess whether observed diversities of minisatellite hap- lotypes could be explained by a random partitioning of isolates from all populations (details in Appendix B). Migration (Nm). Under a simple island model, assuming an equilibrium population where each “island” (QU, SK, and AB) has equal size N, we applied Slatkin’s private alleles method with minisatellite haplotype data from all lineages to estimate Nm, where m is the probability that an individual is a migrant each 3.5. ACKNOWLEDGMENTS 31

generation [251, 253]. Nm estimates were converted to FST values, and vice versa, using the standard expectation for a haploid population FST = 1/(1 + 2Nm) [253] (see AppendixB).

3.4.5 Demographic Modeling

Motivated by the results of the diversity calculations, we used minisatellite genotypes and rejection sampling, an Approximate Bayesian Computation (ABC) procedure, to estimate the factor (ω) by which the size of the DS6Quebec lineage in Remote Saskatchewan (RS), prior to its likely expansion, is related to its current size. Assuming a constant population size after the expansion, we estimated the current effective population size of RS as 544 (multiplying the harmonic mean of fluctuating DS6Quebec TB case counts in RS from 1986-2004 by 18, the number of years over which exhaustive samples were obtained). This epidemiological estimate of Ne was interpreted in the standard population genetic manner; see AppendixB for further dis- cussion. One million coalescent simulations of minisatellite haplotypes under a population expansion model were performed using SIMCOAL 2.0 [73, 151]. The time of expansion was fixed at 54 generations, with ω drawn from a prior Uniform distribution on [0.01, 1]. The simulated data were summarized with four statistics (number of distinct and singleton haplotypes, haplotype diversity, and mean variance in locus repeat sizes), which produced reliable estimates in our method validation and have been used in previous studies [17, 70, 71, 219] (see AppendixB). Values of ω resulting in simulations with summary statistics within a threshold Euclidean distance from those observed in RS were retained to produce a posterior distribution. We also estimated the posterior probabilities of the population expansion and null constant size models by the acceptance probabilities of each model. We tested the method’s performance using simulated data with known expansion sizes. The method is most powerful for pronounced expansions (ω ≈ 0.1); with high accuracy, it can both estimate ω and distinguish between the population expansion and constant size models (see AppendixB and Table B.5).

3.5 Acknowledgments

We thank T MacMillan and D Popescu for assistance with data collection; A Avelar, T Van, E Heinemeyer and C-Y Wu for technical assistance; and G Dolganov for advice and assistance with real time PCR assays. This work was supported by NIH grant 5K08AI67458-2 to CP, National Institutes of Health (NIH) grant GM28016 to MWF and a National Science Foundation (NSF) Graduate Research Fellowship (grant number DGE-0645962) to JMG. 32 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

3.6 Figures and Tables

A Fur Trade Transportation Networks M.tuberculosis Lineages Canoe Route, E-NW DS6Quebec Canoe Route, Hudson Bay access Railway H37Rv-like Steamboat Route N UNA V U T Non Euro-American NO RT HW EST Regional character of trade, ca. 1920 TE Rd 115 RRIT ORIES Remote/ Traditional Rd 174 Rd 183

B Rd 182 Rd 219 C RIT OL IS UM H BIA

B a y A H u d s o n LBE RTA M A N I T O B A Q U É B E C

SA SKAT CHEWAN

B O N T A R I O

M.tb diffusion

M.tb expansion

1608 1710 1760 1820 1885 1930

New France Fur traders British HBC Rail (CPR). Bush reach conquest merger Buffalo planes Northwest depleted.

Figure 3.1: Fur trade geography, regional M. tuberculosis lineage frequencies and historical timeline. A) Map of Canada from Natural Resources Canada [182]. The main fur trade canoe route from the St. Lawrence River to the Beaufort Sea (Montreal route) is in light blue. Canoe routes from Hudson’s Bay to the interior are shown in red. Geography of canoe routes is based on a map by A. Ray in [120]; this was checked against a map of fur trade posts [181]. Geography of railways, steamship lines and areas classified as remote/traditional on the basis of archival evidence (all ca. 1920) are from Figure 38 in [120] and [217]. Proportional lineage frequencies of M. tuberculosis isolates are shown as pie charts in the corresponding province (see also Table B.2). Nomenclature of M. tuberculosis lineages is from [88] and [185]. Pie charts are the same size for clarity, although total sample sizes differed between populations. Lineage frequencies were unavailable for MB. B) Events shown in the timeline are the founding of New France (Quebec) in 1608; incursion of fur traders to the Northwest around 1710; British conquest of New France in 1760 (and end of migration from France to Quebec); merger of North West Company and Hudson’s Bay Company (HBC) in 1820 (with subsequent abandonment of Montreal route in favor of Hudson’s Bay routes); completion of the Canadian Pacific Railway (CPR) in 1885, by which time Western buffalo herds were severely depleted; and, finally, widespread use of bush planes to reach remote areas, starting in the 1930s. Gray boxes indicate the estimated timing of processes in M. tuberculosis demographic history: dispersal of M. tuberculosis to indigenous populations (light gray), and expansion of M. tuberculosis populations as a result of shifts in host ecology (dark gray). 3.6. FIGURES AND TABLES 33

QU

ON

SK

AB

1 repeat

Figure 3.2: Minimum spanning tree of minisatellite haplotypes (12 loci) from M. tuberculosis within the DS6Quebec lineage, in QU, ON, SK, AB. The size of nodes is proportional to the frequency of TB cases associated with that M. tuberculosis haplotype. Scale is indicated by the line at the bottom of the figure, which represents a difference of one minisatellite repeat. Minisatellites not available from MB. 34 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

140 QU MB SK AB 4.5 120

100 4.0

80 3.5

60 Diversity 3.0

40 Numberdistincttypes RFLP 2.5 20 0 A 2.0 B 0 50 100 150 QU MB NRS RS AB

Number of sampled Population

Figure 3.3: Genetic diversity of M. tuberculosis populations. A) Number of distinct RFLP haplotypes as a function of the number of sampled chromosomes obtained using rarefaction (see Methods). Populations include QU, MB, SK, and AB; ON is not included due to its low sample size (see AppendixB). Every fourth data point is presented for clarity. B) RFLP haplotype diversity (Shannon index), correcting for sample size by repeatedly sampling the total number of isolates in the smallest sample from each population (NRS, n = 123). Boxplots indicate values obtained over all samples; value for the smallest sample (NRS) is indicated by a line. Populations are as in part A), except that SK is split into Remote Saskatchewan (RS) and Non-Remote Saskatchewan (NRS). Although MB contains both remote and non-remote regions (Figure 3.1), detailed geographic data were not available for this sample; diversity shown here is for the entire sample. 3.6. FIGURES AND TABLES 35

Table 3.1: Within- and between-lineage AMOVA based on 12 minisatellite loci for M. tuberculosis from Quebec, Ontario, Saskatchewan, and Alberta. Bacterial Source of Variation D.f. SS Variance % Variation FST Lineage component DS6Quebec Among populations 3 16.25 0.04 3.81 0.04 Within populations 583 521.26 0.89 96.19 H37Rv-like Among populations 3 181.68 0.89 51.56 0.52 Within populations 306 256.44 0.84 48.44 Rd 182 Among populations 3 8.26 0.33 22.26 0.22 Within populations 59 67.95 1.15 77.74 Rd 219 Among populations 3 14.28 0.25 20.97 0.21 Within populations 65 60.68 0.93 79.03 All lineages Among populations 3 75.54 0.10 7.95 0.08 Within populations 1,040 1227.16 1.18 92.05

D.f., degrees of freedom; FST , fraction of variation among populations; SS, sum of squares. 36 CHAPTER 3. M. TB. DISPERSAL AND THE CANADIAN FUR TRADE

Table 3.2: Genetic timing estimates for DS6Quebec M. tuberculosis lineage in calendar years. Parameter Historical correlate Point estimate(s)* Confidence Historical interval† prediction‡ TMRCA (QU) French migration to Quebec 1740 1709-1771 1608-1760 TMRCA (SK) Fur trade expansion West 1797 1777-1815 1730-1870 TMRCA (AB) 1779 1751-1804 1750-1870 TD (QU/SK) Separation of populations 1789, 1884, 1910 1788-1979 1870 TD (QU/AB) 1779, 1901, 1885 1805-1966 1870

*Three point estimates are shown for TD. The first is the earliest bound for divergence, where a variance of 0 is assumed for minisatellite repeat number in the ancestral population. For the second estimate, we assume that variance in the ancestral population is equal to the variance of haplotypes found presently in the QU population and shared with the founded population (SK or AB). The third point estimate is based on an assumption that variance in the ancestral population was equal to the variance found presently in haplotypes in the founded population (SK or AB) that are also found in QU. †Results presented for TMRCA assume a star-like genealogy. TMRCA confidence intervals are larger if constant population size or exponential growth are assumed (see AppendixB). ‡Estimate of timing of M. tuberculosis population events, based on historical (nongenetic) data. References for date ranges are Charbonneau [46] for timing of migration to Quebec from France and Innis [120] for other dates. The earlier bounds for fur trade expansion into Saskatchewan and Alberta are based on the time of establishment of fur trade posts in these regions. Note that this timing is later than expansion into the Northwest as a whole (ca. 1710), which includes Western Ontario and Manitoba. Timing of separation of populations is based on analyses of Innis [120] indicating that the Hudson’s Bay Company shifted to less labor-intensive methods of extracting furs (and thus had no need of French Canadian voyageurs), starting about this time. Analyses of interprovincial migration from the 1870s to the early 20th century are also consistent with minimal migration of French Canadians to the Western provinces [92]. 4

Limited Evidence for Classic Selective Sweeps in African Populations

Julie M. Granka∗, Brenna M. Henn§, Christopher R. Gignoux∗∗, Jeffrey M. Kidd§§, Carlos D. Bustamante§, Marcus W. Feldman∗

* Department of Biology, Stanford University, Stanford, CA, USA § Department of Genetics, Stanford University, Stanford, CA, USA ** University of California, San Francisco, CA, USA §§ Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA

Genetics. 2012 Nov; 192(3):1049-64.

Abstract

While hundreds of loci have been identified as reflecting strong positive selection in human populations, connections between candidate loci and specific selective pressures often remain obscure. This study investi- gates broader patterns of selection in African populations, which are underrepresented despite their potential to offer key insights into human adaptation. We scan for hard selective sweeps using several haplotype and allele frequency statistics with a dataset of nearly 500,000 genome-wide single nucleotide polymorphisms in 12 highly diverged African populations that span a range of environments and subsistence strategies. We find that positive selection does not appear to be a strong determinant of allele frequency differentiation among these African populations. Haplotype statistics do identify putatively-selected regions that are shared across African populations. However, as assessed by extensive simulations, patterns of haplotype sharing between African populations follow neutral expectations and suggest that tails of the empirical distributions contain false positive signals. After highlighting several genomic regions where positive selection can be inferred with higher confidence, we use a novel method to identify biological functions enriched among populations’ empirical tail genomic windows, such as immune response in agricultural groups. In general, however, it seems that current methods for selection scans are poorly suited to populations which, like the African pop- ulations in this study, are affected by ascertainment bias, have low levels of linkage disequilibrium, possibly old selective sweeps, and potentially reduced phasing accuracy. Additionally, population history can con- found the interpretation of selection statistics, suggesting greater care is needed in attributing broad genetic patterns to human adaptation.

37 38 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

4.1 Introduction

Elucidating the selective pressures that human populations have encountered, as well as the means by which they have adapted to them, are central aims of evolutionary biology and anthropology. Recently, statistical methods in population genetics have been applied to genome-wide polymorphism data to identify genetic loci that may have experienced natural selection. Such inferences have primarily been made under the model of a hard selective sweep, where a new allele rapidly rises to fixation within a population due to positive selection [3, 207, 212, 231, 232, 233, 234, 291, 303]. While these genome-wide scans have detected hundreds of loci as focal sites of selective sweeps, most connections between the loci and their selective pressures remain unknown. Lack of phenotypic or functional information, the relative effect of background selection, demography, and statistical noise have been suggested as some of the causes of the lack of agreement between studies [3, 105, 106, 269].

In light of these issues, it is crucial to understand broader patterns of adaptation in human populations before identifying individual putatively selected loci [52, 207]. One strategy is to compare across populations the regions of the genome that appear to be undergoing positive selection while incorporating knowledge of populations’ environments and lifestyles, providing insight into the selective forces acting on the same loci [95, 96, 97, 147, 207, 291]. A second approach is to search for biological functions that could be under strong positive selection within populations, identifying functions that are more frequently associated with putatively-selected regions of the genome than expected by chance (see, for example, [291] and [170]).

Using both approaches, we sought a more complete picture of adaptation in human African popula- tions. Previous studies [130, 147, 233, 291] have reported strong signals of positive selection in African populations. Local adaptation is expected to be common within Africa, as African populations practice a range of subsistence strategies and inhabit diverse environments to which adaptation may have occurred [40, 61, 62]. Selection may also have acted more efficiently in African populations, many of which have maintained large effective population sizes for the past 50,000 years, in contrast to populations outside of Africa [40, 103, 148, 267]. While African populations would appear ideal for understanding the extent of positive selection in humans, they have usually been underrepresented in such studies.

In a set of 12 highly diverged hunter-gatherer, pastoralist, and agriculturalist African populations, we search for genomic evidence of positive selection. We analyze genome-wide data of nearly a half a million single nucleotide polymorphisms (SNPs), and include populations from the Diversity Project (HGDP) [157], HapMap [121, 122, 123] and four hunter-gatherer populations for which thorough selection scans have not yet been conducted [103, 242]. As potential candidates for positive selection, we identify out- liers in the empirical genomic distributions of several commonly-used selection statistics, including measures of population differentiation and hard selective sweeps. 4.2. METHODS 39

4.2 Methods

4.2.1 Samples

We performed selection scans on genomic data from individuals sampled from 12 African samples using a set of nearly 500,000 single nucleotide polymorphisms (SNPs). Our sample of African populations includes agriculturalists, pastoralists, and hunter-gatherers, as well as individuals from rainforests, deserts, and other environments. Populations included from the Centre d’Étude du Polymorphisme Humain (CEPH)-HGDP (sample sizes in parentheses) were the South African Bantu (8), Kenyan Bantu (11), Biaka Pygmies (22), Mbuti Pyg- mies (13), Mandenka (22), Namibian San (5), and Yoruba (21); Mozabites were excluded [157, 227]. We also included 12 Namibian San individuals from Schuster et al. [242], who were combined with the HGDP Namibian San for a total of 17 Namibian San individuals; we also included additional individuals from the Hadza and Sandawe of Tanzania and the 6=Khomani Bushmen of South Africa, described by Henn et al. [103]. As in Henn et al. [103], we removed closely related individuals, leaving 17 Hadza, 28 Sandawe, and

31 6=Khomani Bushmen. (For FST calculations, we removed admixed individuals identified in Henn et al. [103], leaving 11 Hadza and 23 6=Khomani Bushmen). HapMap Maasai trio parents from Kinyawa, Kenya (MKK) (46 individuals) and Yoruba trio parents from Ibadan, Nigeria (YRI) (100 individuals) were also an- alyzed [121, 122, 123]. (As Pemberton et al. [201] found evidence of undocumented relatedness between some Maasai individuals, we removed a subset from the original dataset; see AppendixC Methods). For some analyses, we included HapMap CEPH Utah residents with ancestry from northern and western Europe

(CEU) trio parents (88 individuals). HapMap Tuscans (88 individuals) were included for pairwise mean FST and population differentiation calculations [103]. For a summary, see Table C.1. After merging the SNPs genotyped in the HapMap samples, in the CEPH-HGDP samples on the Illumina 650K platform, in the samples from Henn et al. [103] on the Illumina 550K platform, and sequence data from Schuster et al. [242], a total of 461,767 SNPs remained. For haplotype-based selection statistics (see below), we pruned additional SNPs and phased genotype data from 461,154 SNPs using BEAGLE [36] (as in Henn et al. [103], see AppendixC Methods). Known parent-offspring trios from the HapMap YRI and 6=Khomani Bushmen were used as seeds, with all default parameters, to increase phasing performance [37]. All populations were phased in this manner except for the HapMap CEU, YRI, and MKK, for which phase information was available from the HapMap website. 457,682 autosomal SNPs were used for analyses of single SNP selection statistics (see below). Ancestral alleles for all SNPs were determined by their state in chimpanzee; chimpanzee positions were obtained from the Ensembl database (http://www.biomart.org, accessed August 2011). If the ances- tral state was unavailable or ambiguous, the SNP was removed. All datasets were merged to Human Genome Build hg18 (NCBI Build 36). When needed, we used the recombination map available through the HapMap [121, 122, 123]. 40 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

4.2.2 Single SNP Statistics

We calculated genome-wide (over all SNPs) mean FST , a measure of population differentiation, between all pairs of African populations, and for each African population against the HapMap Tuscans, using the program GENEPOP [220, 230]. GENEPOP was also used to calculate per-SNP FST values between all pairs of African populations. We also calculated FST over all populations for each SNP as in Akey et al. [4], following Weir and Cockerham [299]. For these analyses, we included only the HGDP Yorubans (not the HapMap YRI due to their close relation to the HGDP population), for a total of 11 African populations. We classified a SNP as genic if it was within 2 kb of a gene, and non-genic otherwise. Gene locations were obtained from University of California, Santa Cruz (UCSC) refFlat mappings of RefSeq genes to hg18

(downloaded in May 2010, http://www.genome.ucsc.edu). We arranged values of FST calculated over all populations for each SNP in bins with increments of 0.05. Within each bin, an enrichment value was calculated as the proportion of genic (or nongenic) SNPs in that bin divided by the genome-wide proportion of genic (or nongenic) SNPs. To assess significance, we resampled 200 kb segments of the genome and re-calculated enrichments in each FST bin [52]. Upper and lower 2.5% values over 1,000 bootstrap samples were used to obtain 95% confidence intervals. We also calculated the absolute value of the derived allele 11 frequency difference (|δ|) between all pairs of populations (55, or 2 , pairs). We averaged the |δ| values over all pairs for each SNP (to create one measure of differentiation) and calculated genic enrichments (in bin sizes of 0.03), assessing significance as above. We also calculated enrichments in bins of (non-absolute) δ values for individual pairs of African populations, as well as between the HGDP Yorubans and HapMap Tuscans, again assessing significance by bootstrapping (see AppendixC). For all population pairs, we evaluated the relationships between the top 99.99% tail values of |δ| (over all SNPs) and mean FST (averaged over all SNPs), and fitted a curve using the lowess() function in R (http://www.r-project.org) (see AppendixC Methods for additional analyses). We simulated the expected relationship under neutrality between the top 99.99% tail of |δ| values and mean FST using the “beta-binomial” method of Balding [11] (also used in [52]), where allele frequencies in each population are drawn from a beta-binomial distribution whose variance is a function of the FST between the two popula- tions (see AppendixC Methods). For each pair of populations, we repeated this simulation independently for 457,682 loci (the number of autosomal SNPs), and determined the 99.99th percentile value of |δ| over all loci. For each population pair we plotted the mean and standard deviation of the 99.99% |δ| values over 25 simulations. A best fit curve through the simulated mean values was drawn using the lowess() function in R.

4.2.3 Haplotype Statistics

We calculated several haplotype statistics widely used for detection of selective sweeps: the iHS (Integrated Haplotype Score) [291], which identifies ongoing and incomplete sweeps, and the Cross Population Extended Haplotype Homozygosity Test (XP-EHH) [233], which detects complete or nearly complete sweeps occurring in one population but not in another “reference” population (see AppendixC Methods for more details). Both 4.2. METHODS 41

statistics were calculated using scripts from the HGDP Selection Browser website [207](http://hgdp. uchicago.edu). Phased genotypes, ancestral SNP states, and recombination maps were determined as described previously. Reference populations used for the XP-EHH calculations were the HapMap CEU (“XP-EHH CEU”), HapMap YRI (“XP-EHH YRI”), HapMap Maasai (“XP-EHH MKK”), and 6=Khomani Bushmen (“XP-EHH KHB”) (see AppendixC for details). Using a more closely-related African reference population with a relatively more stable demographic history avoids artifacts caused by an out-of-Africa expansion, which could strongly affect the XP-EHH CEU statistic. iHS and XP-EHH CEU were calculated for all 12 populations, XP-EHH YRI for ten (HGDP Yorubans and HapMap YRI were excluded), and XP-EHH MKK and XP- EHH KHB for 11 (Maasai and 6=Khomani Bushmen, respectively, were excluded) (see Table C.1). Per-SNP statistics were normalized in bins of SNPs with similar minor allele frequencies (in increments of 0.05) for iHS as in Voight et al. [291], and over all SNPs for XP-EHH as in Sabeti et al. [233].

Windows and Haplotype Statistics

We broke the genome into non-overlapping windows of 100 kb to identify regions of the genome under selection. Because of the lower linkage disequilibrium (LD) in African populations [103], we reduced the 200 kb window size used by Pickrell et al. [207] (see AppendixC Methods; see also Conrad et al. [51]). As a summary statistic for each window, we used the maximum XP-EHH value of all SNPs within a window; for iHS, we used the proportion of SNPs in the window with |iHS| > 2 [207]. The variance of SNP statistics within 100 kb windows was significantly smaller than that among randomly selected SNPs (see AppendixC Methods and Results), confirming that the windows effectively grouped SNPs with more similar statistic values. We binned genomic windows according to their numbers of SNPs in increments of ten SNPs (combining the few windows with ≥ 50 SNPs into one bin). We then obtained an empirical p-value for each window based on its ranking (according to its window statistic) in its SNP frequency bin. (See AppendixC and [52].)

Overlap of Signals Between Populations

As a summary, we first obtained the five genomic windows with the lowest empirical p-values in each pop- ulation, and their p-values in all other populations, for iHS, XP-EHH CEU, XP-EHH YRI, XP-EHH MKK, and XP-EHH KHB (see AppendixC Methods for more details). To quantify the similarity of selection signals between pairs of populations, we counted the number of 100 kb genomic regions present in the top empirical 1% (and 0.1%) of both populations of the pair using the intersectBed program in the program suite BEDTools [215]. For each haplotype statistic, this resulted in a matrix of values over all population pairs. For each population pair we also computed a correlation between the two populations’ genome-wide window statistic values to construct another matrix for each haplotype statistic. We calculated a series of Mantel correlations of these matrices with a matrix of between-population

FST values; we obtained p-values by comparing observed correlations to those of 5,000 permutations, where 42 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

rows and columns of the original matrices were randomly permuted, using the mantel.rtest() function in R. See AppendixC for additional analyses. We also tested the relationship of the matrix of overlaps of empirical tail windows with a matrix of average measures of Bantu ancestry for the populations of each pair (with a Mantel test as above). To calculate this ancestry measure, we first averaged the proportions of Bantu ancestry, inferred from the population structure analyses of Henn et al. [103], over all individuals in each population separately. Then, for each population pair, we averaged the two population Bantu ancestry values.

4.2.4 Coalescent Simulations

We performed genome-wide simulations of XP-EHH and iHS using the coalescent simulation program msms [72] under models that were neutral and models that included positive selection. We simulated a three pop- ulation model: one population served as the reference population for XP-EHH, and the other two were the populations for which XP-EHH and iHS were calculated. 1 t We assumed an equilibrium model and used the formula (1 − FST ) = (1 − ) , where t is the diver- 2Ne gence time in generations, to translate FST values to divergence times [115]. To mimic the XP-EHH CEU statistic, we simulated an FST of 0.15 between the reference population and the two daughter (African) pop- ulations (see FST between African populations and Tuscans, Table C.2). We simulated daughter population divergence times to correspond to FST values of 0.01, 0.05, and 0.10, spanning the range of FST values between the African populations in our dataset (Table C.2). We assumed 12,000 as the effective size (Ne) of each population ([238]; see AppendixC and Figure C.12 for details.) We did not rigorously model the ascertainment strategy of the Illumina genotyping platform, as we were not primarily interested in its precise effect [68]. As a proxy, we first removed all simulated SNPs with frequency less than 0.05 in the “European” reference population. We calculated allele frequencies for each SNP across all African populations of our dataset and grouped them by minor allele frequency (MAF) in bins of size 0.05; we then randomly thinned the simulated SNPs to match this MAF distribution and observed SNP density [233]. Although this is only an approximation to the properties of the SNPs of our dataset, it should be sufficient for a window-based analysis of haplotype statistics [51]. Using these simulated and pruned SNP data, in each of the two daughter populations we calculated iHS and XP-EHH using the first population as a reference. We followed the same procedure as for observed data: we normalized statistic values across the genome, broke the genome into 100 kb windows, and calculated empirical p-values for windows based on their numbers of SNPs. We calculated the overlap in tail windows of the two daughter populations using the p-value cutoffs 0.05, 0.01, 0.005, and 0.001. We repeated this genome-wide simulation for 100 independent replicates for each daughter-population FST value (0.01, 0.05, and 0.10). We simulated 20 haplotypes from each population. For neutral simulations, we simulated segments of 10- Mb and concatenated 300 of these segments to mimic an entire genome (3,000 Mb in total). We also simulated data under a model of a selective sweep occurring in the first daughter population; the second daughter 4.2. METHODS 43

population and the reference population were assumed to evolve neutrally. Since we used an empirical outlier approach, we simulated only 50% of the genome under this model of selection (300 5-Mb segments); we simulated another 150 10-Mb segments under neutrality for a total of 3,000 Mb (see AppendixC Methods). For simulations under selection, the position in the middle of each 5-Mb segment (at 2.5 Mb) was the selected site, although the site was not sampled. The positive selection strength of the selected allele was s = 0.2; while unrealistic, this allowed for efficient simulation of very strong selective sweeps that could be detected with a high true positive rate by both iHS and XP-EHH (see AppendixC Methods). Based on preliminary simulations, we set the time of selection as 62.4 generations before the present, again allowing both XP-EHH and iHS to have reasonable ability to detect the sweeps (see AppendixC Methods). We only included simulations where the frequency of the selected allele was non-zero at sampling time (i.e., it had not 1 been lost), and where the selected allele had arisen only once (i.e., at a frequency N for a hard sweep); we discarded all others. See AppendixC Methods for additional details. For each simulation, we calculated the true positive rate as the percentage of genomic windows in each empirical tail that were simulated under selection (i.e., that were genomic windows within a 5-Mb selection simulation segment). Because signals of selection may not extend across an entire 5-Mb, this may be an overestimate of the true positive rate, and an underestimate of the number of false positives. Limitations of the msms program at the time of analysis prevented reliable simulation of parallel selective sweeps at the same site in both daughter populations (Gregory Ewing, personal correspondence).

4.2.5 Gene Annotation Enrichment for Haplotype Statistics

To determine whether Gene Ontology (GO) biological process terms (http://www.geneontology. org) were significantly over-represented within populations in genomic regions with the most extreme se- lection statistics, we implemented a permutation approach similar to that of Begun et al. [22] (see Figure 4.1 and AppendixC Methods). To assess significance in the top x% of genomic windows, we randomly sampled x% of genomic windows from the genome, accounting for the number of SNPs within each window as when assigning p-values to observed windows (see Figure 4.1), and obtained the number of occurrences of each GO term. The distribution of the number of occurrences of a GO term over 20,000 of these randomizations was used as a null distribution to obtain a p-value for its number of occurrences in the observed top x%. In addition to examining all GO terms, we examined a subset of terms present in the PANTHER database (≈ 260 terms, associated with 28,173 genes), studied by Voight et al. [291](http://www.pantherdb.org; downloaded in November 2010). To account for multiple hypothesis testing of many GO terms, for each pop- ulation we controlled the false discovery rate (FDR) at the 5% level using the procedure of Benjamini and Hochberg [26]. Python and R scripts used to perform the analysis are available upon request; see Appendix C Methods for more details. We focused on enrichments in the 0.1% empirical tail for each haplotype statistic for each population, based on results from between-population comparisons of empirical tail windows (see Results and Table C.3). We merged adjacent empirical tail genomic windows and did not report GO terms that achieved significance 44 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

only because a gene spanned multiple contiguous windows. Since significant terms associated with only one genomic window merely indicate that they are rare, we also only reported terms associated with more than one window. To investigate the difference in results obtained when using alternative methods, we replicated the GO enrichment method used by Voight et al. [291] for iHS in the HapMap YRI (see the AppendixC Methods).

4.3 Results

4.3.1 Single SNP Statistics

We first examined population differentiation of single SNPs. We calculated per-SNP autosomal FST over 11 African populations (see Methods, Tables C.1 and C.2), and plotted the enrichment of genic and nongenic

SNPs, in comparison to their genome-wide frequencies, in bins of FST values (Figure 4.2A). Frequent strong positive selection on genic regions would be expected to cause genic SNPs to display higher population differ-

entiation on average than nongenic SNPs [13, 52]. However, SNPs with extreme FST were not significantly

more often genic than nongenic. At intermediate FST values (≈ 0.2 - 0.3), a slight enrichment of genic SNPs was significant.

FST calculated across all 11 populations could have poor power to identify loci under positive selection if populations have experienced mostly local, or independent, selection. For each SNP we also calculated the 11 derived allele frequency difference (δ) between all (55, or 2 ) pairs of populations, following Coop et al. [52] (Figures C.1A and B). Less than half of the pairs of populations exhibited |δ| values greater than 0.8, though most highly differentiated SNPs were genic (Figure C.1C). The most highly differentiated SNPs (|δ| > 0.8) were found in comparisons with the Hadza and Mbuti Pygmies, where one allele was nearly completely absent (see also Figure C.2D). There were also very few SNPs (often less than 5) in each population pair’s most extreme δ value bins. As a result, when assessing significance of enrichment for individual pairs, a majority of the bootstrap samples of genomic regions did not even contain SNPs with extreme δ values – invalidating the bootstrap confidence intervals (see AppendixC Results and Figure C.2). Thus, for no pair did we find a significant enrichment of genic SNPs for the most extreme δ values. Genic enrichments in the next-most-extreme δ bins, with valid bootstrap confidence intervals, were not significant.

Upon averaging |δ| over all pairs of populations to obtain a measure of differentiation similar to FST for each SNP, enrichment of genic SNPs at high values was also not significant (Figure 4.2B). There was a significant enrichment of genic SNPs at low averaged |δ| values, possibly due to purifying selection; however, this enrichment was not observed for FST calculated over all 11 populations (Figure 4.2A; see also Figure C.1C). We also examined genic enrichments for δ between the HapMap Tuscans and HGDP Yorubans, as a comparison to the analysis of Coop et al. [52] for Perlegen Type A SNPs between the HapMap CEU and YRI (Hinds et al. [110]; Figure C.3). Unlike among African populations, confidence intervals were valid, since all 4.3. RESULTS 45

bootstrap samples contained SNPs in each δ bin. However, a genic enrichment among SNPs with high δ was not significant for either the Tuscans or the Yorubans.

For each African population pair, we examined the relationship between their mean FST (calculated over all SNPs) and the upper 99.99% tail value (over all SNPs) of |δ| (Figure 4.3). As mean FST between African populations increased, the 99.99% tail value of |δ| also increased (as in Coop et al. [52]; see also Figure C.4). Under a neutral equilibrium model, we simulated extreme |δ| values for all pairs of populations given their observed pairwise FST values (following Coop et al. [52]; see Methods). The simulated data, as well as the lowess smoothed curve, corresponded closely to observed data (Figure 4.3).

4.3.2 Haplotype Statistics

Haplotype-based statistics iHS [291] and XP-EHH [233] calculated with phased genotype data may be less biased by SNP ascertainment than SNP-based measurements such as FST [51]. For a total of four statistics, we used HapMap CEU, HapMap YRI, HapMap MKK, and the 6=Khomani Bushmen (KHB) as reference populations (see Methods). Candidate genomic windows for positive selection should have selection statistic values in the tails of an empirical distribution across all genomic windows, if such selection is relatively rare [106].

Comparison of Empirical Tail Windows

Genome-wide selection scans often compare candidate signals across multiple populations [103, 207]. An example of this is Figure 4.4, which shows the top five 100 kb genomic windows with the lowest empirical p-values for XP-EHH MKK in each population (see also Figures C.5, C.6, C.7, C.8). For an analysis of patterns of sharing of selective sweep statistics beyond what can be obtained from Figure 4.4, for every pair of populations we calculated the number of 100 kb windows in the 1% tail of the empirical distributions of both populations (approximately 250 tail windows). As a population pair’s mean FST increased (i.e., as populations became more diverged), the number of windows appearing in the top 1% of both populations decreased predictably. Mantel correlations between matrices of mean FST and number of overlapping 1% tail windows between pairs of populations were significant for iHS, XP-EHH CEU, XP-EHH MKK, and XP-EHH KHB (all except XP-EHH YRI) (Figures 4.5 and 4.6 and Table C.3).

We observed a similar, though less pronounced, relationship between mean FST and extent of sharing for the 0.1% tail (Figure C.9 and Table C.3). For most statistics, the extent of sharing of candidate sweep signals thus appeared to be difficult to separate from levels of population divergence (as measured by FST ). As overlap of tail genomic regions for the XP-EHH YRI statistic did not appear to be related to population divergence, we suspected that patterns might be more related to the Bantu expansion, which resulted in extensive admixture within Africa over the past several thousand years [24, 28]. We found a significant negative Mantel correlation of tail overlap (in the top 1%) with the average Bantu ancestry of the populations in the pair (p-value = 0.0006; see Table C.3 and Figure C.11). The more gene flow a pair of populations had 46 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

with the Bantu (for which the YRI are considered a proxy for a source population), the fewer empirical tail genomic windows they shared. We then assessed whether between-population patterns in the empirical tails, expected to reflect positive selection, were qualitatively different from between-population patterns across all genomic windows (sum- marized by the correlation of two populations’ window statistic values across the whole genome). For all statistics, there was a significant positive Mantel correlation between 1% tail overlap and correlation of the two populations’ genome-wide window statistics (Table C.3, Figure C.10). Outlying genomic windows thus did not exhibit between-population patterns that were different from those of genome-wide windows. A more extreme tail may reveal more meaningful signatures of selection, as overlap in the top 0.1% was less related to the genome-wide correlation of two populations’ window statistics (Table C.3; see AppendixC Results for related analyses).

Neutral Coalescent Simulations

To understand the relationship between FST and the extent of overlap of empirical tail genomic windows under neutrality, we performed neutral coalescent simulations of genome-wide SNP data (see Methods). XP-EHH and iHS were calculated for two daughter populations with various divergence times, and a third outgroup “reference” population was simulated to mimic the XP-EHH CEU statistic (Figure C.12).

As the two daughter populations became more highly diverged (i.e., had a larger FST value), the number of windows overlapping in their 1% empirical tails for XP-EHH decreased predictably (Figure 4.6 and Figure C.13). This trend was also apparent for iHS, although there was no significant difference in overlap for

different FST values (Figure 4.5 and Figure C.14). This negative relationship between tail overlap and FST was even stronger for less extreme empirical tails (5%), and weaker for more extreme tails (0.1%), as in our observed data; this is likely because more extreme tails contained fewer windows (Figures C.13 and C.14). Thus, depending on their levels of divergence, populations can be expected to share some genomic windows in their empirical tails, even without positive selection acting in parallel on those windows. In both observed and simulated neutral data, there was a similar relationship between population diver- gence and the extent of sharing of outlying genomic windows between populations. However, there were more overlapping tail windows in the observed data than in the neutral simulations for both XP-EHH and iHS (Figures 4.5 and 4.6). While the simulations were modeled after the XP-EHH CEU statistic, simulated overlap values were comparable to those observed for XP-EHH KHB.

Selective Sweep Coalescent Simulations

To understand the relationship between FST and the extent of overlap of empirical tail genomic windows given selective sweeps, we performed coalescent simulations as above with hard sweeps occurring in one of the two daughter populations, and the other populations evolving neutrally (see Methods). We simulated sweeps with a relatively high selection coefficient (s = 0.2) to enable them to be detected by both XP-EHH and iHS and to allow more efficient simulations (see AppendixC Methods). We also simulated a relatively 4.3. RESULTS 47

recent selection time (≈ 62 generations ago), preventing most selected alleles from reaching fixation so that they could be detected by iHS (AppendixC Methods, Figures C.15 and C.16; Voight et al. [291] and Sabeti et al. [233]).

For both XP-EHH and iHS, there was no relationship between FST and the number of 100 kb windows overlapping in the tails of the daughter populations. The extent of overlap appeared consistent with that expected by chance under independence (Figures 4.5, 4.6, C.15 and C.16). For XP-EHH, the number of overlapping windows was considerably lower than that of observed data and neutral simulations (Figure 4.6); for iHS, the number of overlaps was only slightly lower (Figure 4.5). In simulations of less recent positive selection (144 generations ago), the true positive rate of detecting sweeps for iHS was low – genomic windows simulated under neutrality, as well as selection, appeared in the empirical tails (Figure C.18, see also Figure C.17). Because of these false positive windows, the relationship of FST with the extent of sharing of tail windows between populations was more apparent (Figure C.18, see AppendixC Results). With the same simulation parameters for XP-EHH, the true positive rate remained high, and there was no relationship between FST and overlap of tail windows.

4.3.3 Enrichment of Biological Functions

Another approach used to integrate results of genome-wide selection scans asks whether any biological func- tions (Gene Ontology biological process terms, or GO terms) appear to be more frequent targets of positive selection than expected by chance (http://www.geneontology.org). Within each population, we used a novel permutation procedure (see Methods and Figure 4.1) to search for biological functions signifi- cantly associated with genes in outlying genomic windows (according to values of iHS, XP-EHH CEU, YRI, MKK and KHB). We focused on the 0.1% tail, motivated by the lower correlation of between-population patterns of this more extreme tail with those of the whole genome (see previous section and Table C.3). Many GO terms appeared to be under strong selection within populations (Tables 4.1 and 4.2 and Table C.4). For example, the Mbuti Pygmies showed an enrichment of genes related to phosphate transport, and the Namibian San showed an enrichment for terms related to apoptosis (Table 4.1). In the Mandenka, several GO terms related to antigen processing were enriched (Table 4.2). Several terms appeared to be under selection in multiple populations. In the Biaka Pygmies, Mbuti Pygmies, 6=Khomani Bushmen, Mandenka, and HapMap YRI, terms related to transcription were enriched primarily in the tails for XP-EHH CEU (Table C.4). In the Mandenka and Yoruba, immune response was significantly enriched in top windows for XP-EHH YRI and XP-EHH MKK, respectively. In the Namibian San and the Yoruba, the term brain development was enriched, while the 6=Khomani Bushmen and HapMap YRI both showed enrichment for ubiquitin cycle. The term sodium ion transport was enriched in both the Kenyan Bantu and HapMap YRI, which could motivate future studies in light of the salt retention hypothesis (Table C.4; Thompson et al. [271]). No GO terms were significant in the Hadza for any statistic, and terms were only significant for the Kenyan Bantu at less extreme p-value cutoffs (see AppendixC Results and Table C.4). 48 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

We found no significantly enriched terms in the empirical tails of iHS for the HapMap YRI at any signif- icance level (Table 4.2). In contrast, in their iHS analysis of the same population, Voight et al. [291] found enrichment of several terms including olfaction and MHC-1 mediated immunity. There could be several reasons for this lack of agreement. While we searched for enrichment among genomic windows, Voight et al. [291] searched for enrichments among genes using statistics from 50 contiguous SNPs (see Appendix C Methods); they also analyzed a larger set of SNPs. Since we were able to replicate the results of Voight et al. [291] using our SNP dataset and their enrichment procedure (see AppendixC Methods and Results), the primary cause of the difference in results appeared to be the statistical method for assessing enrichment.

4.4 Discussion

We computed statistics for hard selective sweeps and patterns of SNP variation in a diverse set of human African populations following the approaches of many recent genome-wide scans for positive selection [3, 207, 231, 232, 233, 234, 291]. Patterns of allele frequency differentiation and sharing of haplotype se- lection signals among African populations were consistent with expectations under neutrality, suggesting the presence of false positive signals in empirical tails. We confirmed that the effects of positive selection do con- tribute to some empirical tail loci by identifying biological functions significantly associated with them using a novel permutation procedure. However, population history can confound the interpretation of selection statistics, suggesting that greater care is needed in attributing broad genetic patterns to human adaptation.

4.4.1 Population Differentiation

Positive selection does not appear to have substantially shaped present-day allele frequency differences among the African populations in our dataset (Figures 4.2 and 4.3 and Figures C.1 and C.2). SNP den- sity may partially explain this finding. While Coop et al. [52] found a significant enrichment of genic SNPs among those highly differentiated between HapMap CEU and YRI for Perlegen Type A SNPs (which have a uniform ascertainment based on a multi-ethnic discovery set; Hinds et al. [110]), our analysis of just the Illumina 550K genotype data for HGDP Yorubans and HapMap Tuscans showed no significant enrichment (Figure C.3). We were also unable to identify a significant genic enrichment for SNPs with extreme δ values for any pair of African populations (Figures C.1A, C.1B, C.3 and AppendixC Results). This suggests that in addition to SNP density, ascertainment bias could contribute to our inability to identify significant genic enrichments. In another analysis we discovered phenomena that may be unique to African populations. Extreme values of allele frequency differences between pairs of African populations were closely related to a pair’s overall divergence (as measured by mean FST ), in near perfect concordance with data simulated under a neutral equilibrium model (Figure 4.3 and Figure C.4). (Since selective sweeps are believed to have been relatively rare in recent human evolution, mean FST is likely a valid measure of divergence [106].) In contrast, a study of worldwide populations using a comparable set of SNPs [52] suggested a stronger role of positive 4.4. DISCUSSION 49

selection in increasing levels of allele frequency differentiation beyond that predicted by population history. In addition, most strong signals of differentiation in our dataset resulted from extreme allele frequencies in the Hadza and Mbuti Pygmies, which have each undergone recent extreme bottlenecks (see Henn et al. [103], AppendixC Results, and Figure C.2D). Our findings from SNP array data do not support positive selection as a more important determinant of allele frequency differentiation among African populations than population history. This cautions that popu- lation differentiation measures used with an empirical outlier approach could lead to false positive selection signals in this dataset.

4.4.2 Comparison of Empirical Tail Windows

With the haplotype statistics iHS and XP-EHH, we examined empirically-outlying genomic windows of 100 kb as candidates for selection. A common strategy used to interpret the generated lists of putatively selected loci is to compare them between populations (see, for example, Figure 4.4). Such comparisons have been used to generate hypotheses, such as shared selective pressures, to explain why particular regions of the genome might be under selection in multiple populations [52, 135, 207]. Analyzing coalescent simulations (performed with msms [72]) under models of both neutrality and local adaptation, we find that conclusions from comparative methods can be misleading if there are a large number of false positive signals. Under neutrality, empirical tail genomic windows are expected to be shared between populations, with the amount of sharing determined by a population pair’s level of divergence (Figures 4.5 and 4.6). In simulations under a model of local adaptation, with selective sweeps occurring in only one population of a pair, there was no relationship between divergence and the extent of sharing of outlier haplotype statistics if the statistics had a high ability to detect sweeps (i.e, a high true positive rate and a low false negative rate). Under this scenario, the amount of sharing of haplotype signals between populations was that expected by chance under independence regardless of FST value (Figures 4.5, 4.6, C.15, and C.16). In contrast, if haplotype statistics had insufficient ability to detect selection (due in our simulations to the timing of selection), a slight negative relationship was expected (Figure C.18). This occurred because false positive neutral genomic windows, rather than only those under selection, appeared in empirical tails. Lower power and a higher false positive rate may indeed be expected for sweeps with more realistic parameter values than those we simulated (chosen for efficiency and to maximize the ability of iHS and XP-EHH to detect selection, see Methods) [231, 233, 291]. Due to limitations of msms at the time of analysis, we were unable to simulate selection under a model of parallel, convergent selective sweeps at the same locus in multiple populations (see Methods). Given the above results, if statistics have a high true positive rate, empirical tail signals shared between populations should correctly reflect populations’ true amount of convergent adaptation and appear to have no relationship with population history. In African populations, we observed patterns of between-population divergence and overlap of outlier loci that were very similar to those simulated under neutrality (Figures 4.5 and 4.6; while simulated data 50 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

did not precisely match observed data, differences were likely due to the simulations’ simplicity and non-

uniform SNP density of observed data). The observed strong negative relationship between FST and extent of overlap of tail windows suggests that many signals in the 1% empirical tails may not be driven primarily by local (or convergent) adaptation – under which only a weak relationship would be expected (Figures 4.5 and 4.6). (In the more extreme 0.1% tail, this negative relationship was less pronounced.) Furthermore, between-population empirical tail overlap for the top 1% seemed to follow genome-wide correlations (Figure C.10 and Table C.3). In light of our simulations, false positive outlier loci seem to be a plausible explanation for the close relationship of population history with selective sweep patterns in African populations. Our results agree with Coop et al. [52] and Pickrell et al. [207], who found that selective sweep signals tend to cluster by broad geographic and continental regions, and who emphasize that demography can in- fluence haplotype selection statistics (see also [51, 129]). When methods do not fully or accurately identify signals of selection, sharing of selection signals between populations can have an important contribution from population history. Between-population patterns of observed data and data simulated under both selection and neutrality can thus illuminate the prevalence of false positives in observed data. Previous studies may have spuriously at- tributed a common sweep signal among populations to selection because they did not examine these broader patterns. In light of this, summaries such as Figure 4.4 can be misleading (see Pickrell et al. [207]). Deter- mining significance of selection with empirical p-values has also elicited criticism; since the true proportion of the genome that is under natural selection is unknown, so are the most appropriate empirical cutoffs (see Results; [3, 106, 159, 269, 272, 273]). Finally, it is not usual to calculate XP-EHH using multiple, diverse reference populations, as we do with XP-EHH versus Europeans, Yorubans, Maasai and 6=Khomani San. Previous studies of XP-EHH in Africans have primarily used an out-of-Africa reference population (i.e., XP-EHH CEU), which identifies different signals of selection from when using a more closely-related reference population (see AppendixC Results). Divergence time of the reference population also affects the amount of sharing between populations of outlying genomic windows: fewer tail windows are shared for XP-EHH using as references the Maasai and Yorubans, which are more recently diverged compared to Europeans and 6=Khomani Bushmen (see Figure 4.6 and Figures C.5, C.6, C.7, C.8). As seen with the Yorubans as a reference, gene flow may also play a role (Figure C.11). Thus, divergence time and demography of the reference population should be considered when interpreting comparisons of XP-EHH between populations.

4.4.3 Enrichment of Biological Functions

As an alternative approach to identifying important signals of selection, we searched for biological functions over-represented in the tail of genomic window distributions for the iHS and XP-EHH. Motivated by the hap- lotype statistic analyses and simulations, which suggested a high rate of false positives in the 1% empirical tail, we selected the top 0.1% as a more appropriate cutoff (see Results). Using a statistically rigorous per- mutation procedure (see Methods), we detected enrichment of many Gene Ontology (GO) biological process 4.4. DISCUSSION 51

terms in outlying genomic windows of nearly all studied populations (except for the Hadza, likely because of their recent severe bottleneck; Henn et al. [103]); Tables 4.1 and 4.2. Many of these enrichments suggest previously-unreported selective pressures within individual populations (see Results): examples include the terms blood coagulation in the Sandawe of Tanzania and response to virus in the Maasai. Such findings increase our confidence that some signals inferred using selective sweep statistics do indeed identify areas of the genome that are under selection in African populations. Immune response, for instance, was significantly enriched in empirical tail genomic windows of both the Mandenka and Yoruba, for which other terms such as B-cell differentiation and antigen processing and presentation were also enriched. Since the Mandenka and Yoruba both inhabit western Africa, this may indicate strong pressure from infectious or parasite-related diseases in this region, or may reflect recent selective pressure associated with adoption of agriculture only 5,000 years ago [89]. These enrichments agree with recent studies, which have highlighted pathogens as a primary driver of recent human evolution (Fumagalli et al. [87], Novembre and Han [190]), and which have presented evidence that HLA and other disease-related genes are under selection in West Africans [29]. By preserving the genomic features of our dataset, our permutation method accounts for important factors such as non-uniform SNP density, different gene lengths, the existence of regions of the genome where genes with similar functions are clustered, and the method used to assign empirical p-values to genomic windows (Figure 4.1). While our method follows that of Begun et al. [22], permutation methods are not often implemented for GO enrichment analyses in human genetics [170]. Those commonly used are designed for microarray studies, which present different statistical challenges from analyses of genomic windows, within which several genes can exist [117, 118, 170]. Fumagalli et al. [87], for example, used the method GONOME [256]; although this method accounts for gene size, it does not account for gene clustering within genomic windows (see also [80, 96, 97, 114, 153, 262, 296]). While McLean et al. [169] developed a method for enrichments among genomic segments, it is not appropriate for larger genomic windows containing multiple genes. Studies have acknowledged that these biases could result in false positive signals of enrichment. Voight et al. [291], for example, suggested that gene clusters like the HLA region may have contributed to their significant results (see also Metspalu et al. [170]). In fact, our results for outlying genomic regions for iHS in the HapMap YRI differed from those of Voight et al. [291], likely due to our use of different statistical methodology.

4.4.4 Complicating Factors for Studies of Selection

It has been suggested that selection may be more common outside of Africa than within Africa [52, 113, 134, 140, 179, 259, 303]. While we have identified plausible signals of selection in our African dataset, demographic and technical factors complicate the application of selection statistics. The absence of observed genic enrichment for highly differentiated polymorphisms may be due to as- certainment bias, as the SNPs analyzed here were ascertained primarily in European populations and are 52 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

unlikely to be in LD with unique alleles that may be under selection in Africans [68, 103, 188, 291]. A simi- lar explanation has been invoked to explain the failure of genome-wide association studies (GWAS) in some African populations, e.g. for studies of malaria [184]. The lower LD of African populations may also make it difficult to identify SNPs undergoing allele frequency differentiation due to selection, to detect signatures of increased LD due to a sweep, and to infer chromosomal phasing accurately [103]. To address some of these concerns related to LD, we analyzed genomic windows of 100 kb, in contrast to previous studies that have used windows of twice this size [103, 170, 207]. Our analyses suggest that SNP density can also affect statistical significance, even with a relatively large dataset of nearly 500,000 SNPs and in populations where ascertainment bias is minimal (Figure C.3). Additionally, small sample sizes in some populations may have reduced our ability to accurately estimate allele frequency differentiation or detect hard selective sweeps. The power of the iHS and XP-EHH statistics to detect selective sweeps in African populations may also be limited if the selective events have occurred too far in the past (greater than ≈30,000 years ago; Sabeti et al. [231]). One such example could be the transition to a desert environment in the 6=Khomani Bushmen, which may have exerted selective pressure too long ago to be detected. All of these factors could contribute to the difficulty of unequivocally inferring positive selection. While SNP density and ascertainment bias on existing genotyping arrays are influential, whole-genome sequencing may not be a panacea for the issues we raise (preliminary results; see also Lachance et al. [147]). The sensitivity of widely-used selective sweep statistics to population history, selection timing, phasing error, lower LD, and sample size will likely remain problematic. Although we focus exclusively on hard selective sweeps, other models of selection, such as polygenic adaptation or adaptation based on standing genetic variation, also have the potential to uncover signals of positive selection [53, 87, 95, 96, 97, 106, 211, 212, 268]. Awareness of the issues raised here should also improve analyses of these alternative models of selection.

4.5 Data Access

All genotype data has been published in previous work and is publicly available. See Henn et al. [103], Inter- national HapMap 3 Consortium et al. [121], International HapMap Consortium [122], International HapMap Consortium et al. [123], Li et al. [157], Schuster et al. [242].

4.6 Acknowledgments

We would like to thank Nandita Garud, Laura Rodríguez-Botigué, Philip Greenspoon, Philipp Messer, David Enard, Martin Sikora, and Jonathan Pritchard for helpful comments and discussions. J.M.G is supported by the National Science Foundation (NSF) Graduate Research Fellowship, grant number DGE-0645962. B.M.H. and C.D.B. are supported by National Institutes of Health (NIH) grant R01HG003229. C.R.G. is supported 4.6. ACKNOWLEDGMENTS 53

in part by NIH Training Grant T32 GM007175 and the University of California, San Francisco Chancellor’s Research Fellowship. J.M.K. is supported by NIH Grant T32HG000044. This research is supported in part by NIH grant GM28016 to M.W.F. 54 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

4.7 Figures and Tables

SNPs used in All non-overlapping 100 kb statistic calculation windows with > 0 SNPs for population

Bin windows by # SNPs per window 1-10 SNPs 11-20 SNPs . . . > 50 SNPs

Randomly select x% windows from each bin 1-10 SNPs 11-20 SNPs . . . > 50 SNPs

Count GO term occurrences over all random windows GO Term Count Repeat20,000 times

For each GO term, obtain p-value for observed count

Observed count

p-value

# of occurrences # out of 20,000 of out# permutations

FDR correction to obtain significantly enriched GO terms

Figure 4.1: Permutation procedure used to assess significance of enrichment of Gene Ontology (GO) terms in the top x% of 100 kb genomic windows. Procedure is performed for a particular statistic (iHS, XP-EHH CEU, XP-EHH YRI, XP-EHH MKK, or XP-EHH KHB) for a particular population. Note that the third and fourth boxes demonstrate the method used to assign p-values to genomic windows based on the number of SNPs per window; see Methods for details. 4.7. FIGURES AND TABLES 55

A

2.0

1.5

1.0 Enrichment 0.5 Genic 0.0 Nongenic 0.0 0.1 0.2 0.3 0.4 0.5 Fst B

2.0

1.5

1.0 Enrichment 0.5 Genic 0.0 Nongenic 0.0 0.1 0.2 0.3 0.4 0.5 Average |δ|

Figure 4.2: Enrichment of genic and non-genic SNPs for values of FST and for average values of derived allele frequency difference (|δ|) between pairs of populations. (A) Enrichments of genic and non- genic SNPs for values of FST (calculated across all 11 populations for each SNP), divided into bins of width 0.05. Dashed lines indicate upper and lower 95% confidence intervals for each of the genic and non-genic categories, assessed by bootstrap re-sampling (see Methods). (B) Enrichments for average |δ| values (|δ| is calculated between all pairs of populations for each SNP; average is taken over the pairs), divided into bins of width 0.03. Dashed lines are as in (A). 56 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

1.0

0.9

0.8 | δ | 0.7

99.99% 0.6

0.5 Observed 0.4 Simulated 0.05 0.10 0.15 Mean Fst

Figure 4.3: Extreme |δ| values versus mean FST for pairs of populations. Each point is a different population pair (55 pairs in total). Y-axis is the upper 99.99% tail value (over all autosomal SNPs) of derived allele frequency difference (|δ|) between the pair of populations; X-axis is mean FST (over all SNPs) between the two populations. Black points are observed data; red points are the mean of data points simulated using the beta-binomial method (described in Methods) over 25 independent simulations (error bars indicate +/- 2 standard deviations over the simulations). Best-fit lowess curves are drawn through the points of the same color. 4.7. FIGURES AND TABLES 57

San- Namib. Khom. SA Kenyan Man- chrm begin end Biaka Mbuti Hadza dawe San Bushm. Bantu Bantu denka Yoruba Genes chr9 7E+07 7.3E+07 ###### NONE NONE 0.0699 NONE NONE NONE 0.0633 ###### ###### TRPM3 chr16 7E+07 7.2E+07 ###### 0.0194 0.0121 0.0409 0.0012 0.0236 0.0062 NONE NONE 0.0669 ZFHX3 chr5 1E+08 1.5E+08 0.0002 NONE NONE NONE 0.0188 0.0018 NONE 0.0238 NONE NONE SPINK1,JAKMIP2 chr7 1E+08 1.4E+08 0.0002 0.0002 0.0019 0.0005 0.0022 0.0029 0.0048 0.0103 0.02 0.0099 TPK1 chr9 3E+06 2900000 0.0004 0.0596 NONE NONE NONE 0.0199 NONE 0.0354 0.0305 NONE KCNV2,KIAA0020 chr4 1E+08 1.5E+08 0.0005 ###### 0.0334 NONE 0.008 0.0006 0.0206 NONE NONE NONE SMAD1,C4orf51,MMAA chr11 7E+07 7.3E+07 0.0048 ###### 0.0745 0.0314 NONE 0.0051 NONE 0.0375 NONE 0.0308 MIR139,INPPL1,FCHSD2... chr12 1E+08 1.3E+08 NONE 0.0002 NONE 0.062 NONE NONE NONE 0.0118 NONE NONE PIWIL1,RIMBP2 chr12 2E+07 1.7E+07 0.0028 0.0003 NONE NONE 0.0027 0.0043 NONE 0.0166 0.0504 NONE LMO3 chr1 6E+07 5.7E+07 0.0367 NONE ###### 0.0494 0.0714 0.0583 0.0651 NONE NONE NONE C8B,C8A,C1orf168 chr1 2E+08 2E+08 NONE NONE ###### 0.0752 NONE NONE 0.0496 ###### 0.0019 0.0011 DENND1B,CRB1 chr14 6E+07 6.7E+07 0.0131 NONE ###### NONE NONE NONE 0.0857 NONE 0.0227 0.0118 LOC645431,C14orf53,GPHN,FUT8 chr15 4E+07 3.6E+07 NONE NONE 0.0002 0.0539 NONE NONE NONE NONE 0.0511 0.0267 chr5 1E+08 1E+08 0.0994 NONE 0.0003 NONE NONE 0.031 NONE NONE NONE NONE chr4 4E+07 3.7E+07 NONE 0.022 0.0651 ###### 0.0854 0.0622 0.0169 0.0882 NONE NONE chr3 1E+08 1.2E+08 0.0194 NONE NONE ###### 0.0171 NONE NONE NONE 0.0749 NONE ZBTB20 chr11 1E+08 1.3E+08 NONE NONE NONE ###### NONE 0.0286 0.0463 NONE NONE NONE chr11 6E+07 5.6E+07 NONE NONE NONE 0.0002 NONE NONE 0.0202 NONE 0.0205 NONE OR8K1,OR8U1,OR5M11... chr5 2E+07 2.4E+07 NONE NONE NONE 0.0004 NONE NONE 0.0547 0.0642 0.055 0.0853 PRDM9 chr17 1E+06 1700000 NONE 0.0336 ###### 0.0265 ###### ###### 0.0452 ###### 0.0089 0.0086 TLCD2,MIR22,SERPINF2... chr4 1E+08 1.1E+08 NONE 0.0079 0.0227 NONE ###### 0.0012 NONE NONE NONE NONE chr14 8E+07 8.5E+07 0.0011 0.0079 0.0103 0.0054 ###### 0.0011 0.0249 NONE 0.0175 0.0343 chr12 6E+07 6.5E+07 0.0424 0.0296 0.0705 0.055 0.0002 0.0206 NONE NONE 0.0459 NONE TMBIM4,LLPH,HELB,IRAK3 chr2 7E+07 7.3E+07 NONE NONE NONE 0.0467 0.0003 0.0021 0.0162 0.0046 0.0119 0.0629 EXOC6B,SPR,EMX1 chr22 4E+07 4.1E+07 0.0399 0.0332 NONE ###### 0.0268 ###### 0.0674 NONE 0.0759 0.0183 SERHL,TCF20,RRP7B... chr6 6E+07 5.7E+07 NONE 0.0268 NONE 0.0699 0.0235 ###### NONE 0.0645 NONE NONE RAB23,DST,BAG2,ZNF451... chr1 7E+07 6.7E+07 NONE 0.0225 NONE NONE 0.0458 ###### NONE 0.0416 NONE NONE SGIP1,INSL5,TCTEX1D1... chr8 6E+07 5.7E+07 NONE NONE NONE NONE 0.001 0.0002 NONE NONE NONE NONE LYN,TGS1,TMEM68 chr4 9E+07 8.6E+07 NONE 0.0202 NONE NONE 0.0017 0.0003 0.0234 NONE NONE NONE WDFY3,C4orf12,CDS1 chr10 1E+08 1.2E+08 NONE 0.0322 0.0173 0.0267 0.0315 NONE ###### NONE 0.0907 NONE C10orf84 chr17 3E+07 2.7E+07 0.0033 0.0075 NONE 0.0778 0.0647 0.0356 ###### NONE 0.0035 0.0004 OMG,EVI2A,ATAD5,ADAP2... chr18 2E+07 1.8E+07 NONE 0.0432 0.0816 NONE 0.043 NONE ###### NONE NONE NONE CTAGE1 chr17 6E+07 6.5E+07 NONE 0.0109 NONE NONE NONE NONE 0.0002 NONE 0.0249 0.0468 ABCA8,ABCA9 chr2 2E+08 2.3E+08 0.0679 0.058 NONE 0.0845 NONE NONE 0.0003 0.0937 0.0503 NONE COL4A4,RHBDD1,IRS1 chr6 9E+07 9.1E+07 NONE NONE NONE NONE NONE 0.0359 0.0078 ###### 0.0896 0.0048 BACH2 chr14 7E+07 7.1E+07 ###### 0.0944 NONE 0.0312 0.0297 0.0391 NONE ###### ###### ###### LOC145474,SIPA1L1... chr13 5E+07 4.7E+07 NONE NONE NONE 0.0471 0.0679 NONE 0.0039 0.0002 ###### 0.0157 HTR2A,LRCH1,ESD chr16 2E+07 2.4E+07 NONE 0.0747 NONE 0.0142 NONE 0.0699 0.0783 0.0002 NONE 0.0417 UBFD1,PLK1,GGA2,CHP2... chr8 2E+07 2.3E+07 NONE 0.0267 NONE 0.0591 NONE NONE NONE 0.0003 NONE NONE FLJ14107,C8orf58,BIN3... chr15 9E+07 9.1E+07 0.0054 0.0413 NONE 0.0925 NONE NONE NONE NONE ###### NONE CHD2,RGMA chr19 5E+07 4.8E+07 NONE NONE NONE NONE NONE NONE 0.0876 ###### ###### 0.0152 GSK3A,ERF,CEACAM5,CIC.... chr2 2E+08 2.4E+08 0.0452 0.0109 0.0901 NONE NONE NONE NONE ###### 0.0002 0.0236 RBM44,RAMP1,LRRFIP1 chr3 6E+07 5.7E+07 NONE NONE NONE NONE NONE NONE NONE NONE 0.0003 0.0001 CCDC66,ARHGEF3,C3orf63 chr2 2E+08 1.6E+08 0.027 0.0364 NONE ###### 0.024 0.0055 NONE 0.0844 0.0003 0.0245 KCNJ3 chr11 1E+07 1E+07 NONE NONE 0.0792 NONE NONE NONE 0.0159 NONE 0.0004 ###### SBF2,SWAP70,ADM chr2 2E+08 1.7E+08 NONE 0.064 NONE 0.0975 NONE 0.0513 NONE NONE NONE ###### LOC91149,RAPGEF4 chr17 3E+06 3700000 NONE NONE NONE NONE NONE NONE NONE 0.002 0.0186 ###### TRPV1,TAX1BP3,CTNS... chr3 9E+07 8.8E+07 0.0581 NONE 0.0288 0.0029 NONE 0.0204 NONE 0.0224 0.0241 0.0004 POU1F1,CHMP2B chr3 5E+07 4.6E+07 0.006 0.0012 0.0681 0.0244 NONE 0.0797 0.0126 0.0465 0.0006 0.0089 FYCO1,CCR5,SLC6A20... chr13 6E+07 5.7E+07 NONE NONE NONE 0.0398 0.0779 NONE NONE 0.0285 0.0004 0.0007 PRR20A,PRR20B,PRR20E... chr16 2E+07 2.3E+07 NONE NONE 0.0037 NONE NONE NONE 0.019 0.0539 ###### ###### HS3ST2,USP31 chr17 2E+07 1.7E+07 0.0462 0.0569 0.0253 NONE NONE NONE NONE NONE 0.0043 0.0298 TNFRSF13B,MPRIP,PLD6... chr6 2E+08 1.6E+08 0.0072 NONE 0.0098 0.0069 0.0783 NONE 0.0066 0.0006 0.0238 0.0047 TMEM181,SERAC1,TULP4...

Figure 4.4: P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH MKK statistic in each population. Red indicates p < 0.01, orange indicates p < 0.05, yellow indicates p < 0.10, and white indicates p > 0.10. “Genes” column lists the genes located within the indicated windows. 58 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

A iHS

MDK/YOR *** BPY − Biaka Pygmy MPY − Mbuti Pygmy 40 HDZ − Hadza SWE − Sandawe 30 MKK/SWE SNB − Namibian San MKK/MDK KHB − Khomani Bushmen BSA − Bantu South Africa 20 BKN − Bantu Kenya MKK − Maasai 10 MDK − Mandenka

# Overlapping Windows (1%) Windows Overlapping # YOR − Yoruba BKN/BSA

0.05 0.10 0.15 Mean Fst B Simulated Neutral C Simulated Neutral

20 5.5

15 5.0

4.5 10 4.0 5 3.5 # Overlapping Windows (1%) Windows Overlapping # # Overlapping Windows (1%) Windows Overlapping # 0 3.0 0.01 0.05 0.10 0.02 0.04 0.06 0.08 0.10 Fst Fst D Simulated Local Selection E Simulated Local Selection

12 5.5

10 5.0 8 4.5 6 4.0 4

2 3.5 # Overlapping Windows (1%) Windows Overlapping # (1%) Windows Overlapping # 0 3.0 0.01 0.05 0.10 0.02 0.04 0.06 0.08 0.10 Fst Fst

Figure 4.5: Number of shared 100 kb genomic windows in the top empirical 1% for iHS vs. mean FST for pairs of populations. X-axis is mean FST , Y-axis indicates the number of shared windows in the top empirical 1%. (A) Each point is a different population pair (some points are labeled, see key). A best-fit lowess curve is drawn through the points. Significance of the p-value for the Mantel correlation between the x and y variables is indicated in the upper right corner: “NS”: not significant, “*”: < 0.05, “**”: < 0.01, “***”: < 0.001. (B) Boxplot of overlap in the top empirical 1% of iHS between populations with the indicated simulated FST values, for 100 independent neutral simulations. (C) Means +/- standard errors of overlap, for the neutral simulations of part (B). (D) Boxplot of overlap in the top empirical 1% of iHS between populations with the indicated simulated FST values, for 100 independent simulations of a selective sweep in only one population with selection strength s = 20% and time of selection 62 generations ago. (E) Means +/- standard errors of overlap, for the simulations in part (D). 4.7. FIGURES AND TABLES 59

A XP−EHH CEU B XP−EHH YRI

MDK/YOR SNB/KHB NS 100 *** 60 BKN/MKK SNB/KHB 50 80 40

30 60 MKK/SWE 20

40 10 # Overlapping Windows (1%) Windows Overlapping # (1%) Windows Overlapping # 0 0.05 0.10 0.15 0.05 0.10 0.15 Mean Fst Mean Fst C XP−EHH MKK D XP−EHH KHB

SNB/KHB *** MDK/YOR 60 BKN/MKK *** 40 MDK/YOR 50 BPY/SNB 30 40

20 30 20 10 10 # Overlapping Windows (1%) Windows Overlapping # # Overlapping Windows (1%) Windows Overlapping # BSA/SWE BPY/SNB 0 0 0.05 0.10 0.15 0.05 0.10 0.15 Mean Fst Mean Fst E Simulated Neutral F Simulated Local Selection

12 60 10 50 8 40 6 30 4 20 2 10 # Overlapping Windows (1%) Windows Overlapping # # Overlapping Windows (1%) Windows Overlapping # 0 0.01 0.05 0.10 0.01 0.05 0.10 Fst Fst

Figure 4.6: Number of shared 100 kb genomic windows in the top empirical 1% for XP-EHH vs. mean FST for pairs of populations. X-axis is mean FST , Y-axis indicates the number of shared windows in the top empirical 1%. (A-D) Each point is a different population pair (some points are labeled, see key and legend of Figure 4.5.) (E) Boxplot of overlap in the top empirical 1% of XP-EHH between populations with the indicated simulated FST values, for 100 independent neutral simulations. (F) Boxplot of overlap in the top empirical 1% of XP-EHH between populations with the indicated simulated FST values, for 100 independent simulations of a selective sweep only in one population with selection strength s = 20% and time of selection 62 generations ago. 60 CHAPTER 4. SELECTIVE SWEEPS AND AFRICAN POPULATIONS

Table 4.1: Significantly enriched Gene Ontology (GO) terms in the top empirical 0.1% of windows in each population. Population Statistic GO term* Name of GO term p value Windows Genes Biaka Pygmy XP-EHH CEU GO:0045893 positive regulation of 0.0002 8 TGFB1I1, TCF3, transcription, DNA- PCBD2, NSD1, TFEB, dependent FOXK1, CREB5, SMARCD3 XP-EHH KHB GO:0006412 translation 0.01035 3 AC107956.8, EIF4A1, AL121593.10 Mbuti Pygmy XP-EHH CEU GO:0006366 (P) transcription from RNA 0.00085 3 HLF, NFIX, NFKB1 polymerase II promoter XP-EHH CEU GO:0007186 G-protein coupled re- 0.02015 4 CENTD2, NFIX, ceptor protein signaling GPR35, NFKB1 pathway XP-EHH MKK GO:0006817 phosphate transport 0.0067 2 COL9A3, COL6A1 Sandawe XP-EHH YRI GO:0007596 blood coagulation 0.01185 2 ADORA2A, IL22RA2 XP-EHH KHB GO:0046058 cAMP metabolic process 0.0001 2 PTH, PTHLH Namibian San XP-EHH CEU GO:0042981 regulation of apoptosis 0.00775 2 TNFSF5IP1, MALT1 XP-EHH YRI GO:0006916 anti-apoptosis 0.0135 2 MALT1, YWHAZ XP-EHH YRI GO:0006754 ATP biosynthetic process 0.0002 2 ATP5B, SLC25A13, ATP6V1B2 XP-EHH MKK GO:0001764 neuron migration 0.00525 2 ITGA3, NTN1, MYH10, PAFAH1B1, YWHAE, CDK5R1, RELN, TWIST1 XP-EHH MKK GO:0030036 actin cytoskeleton orga- 0.01555 2 CAP1, CRK nization XP-EHH MKK GO:0007420 brain development 0.00905 2 PPT1, RELN 6=Khomani Bushmen XP-EHH YRI GO:0043687 post-translational protein 0.00135 2 UBE2L3, UBE2E2 modification XP-EHH YRI GO:0051246 regulation of protein 0.00135 2 UBE2L3, UBE2E2 metabolic process XP-EHH YRI GO:0006457 protein folding 0.0143 2 DNAJC1, PPIL2 XP-EHH YRI GO:0006512 ubiquitin cycle 0.0193 3 HECW2, UBE2L3, UBE2E2 SA Bantu XP-EHH MKK GO:00006814 sodium ion transport 0.01905 2 SLC4A5, ACCN5 Maasai iHS GO:0009615 response to virus 0.00225 2 CREBZF, IFNA5 iHS GO:0007584 response to nutrient 0.0014 2 LCT, UGT1A8 iHS GO:0042493 response to drug 0.00375 2 LCT, UGT1A8 XP-EHH YRI GO:0006487 protein N-linked glyco- 0.00045 2 MGAT5, KEL sylation Significance determined as described in Methods and Figure 4.1, using an FDR cutoff of 0.05, within each population. A population is not shown if no terms were significantly enriched (i.e., the Hadza and Kenyan Bantu). The “Statistic” column indicates the statistic resulting in significance. *Next to a GO term, “(P)” indicates that the term was only significant when examining the PANTHER subset of terms; if nothing is next to the term, it indicates significance only when all GO terms were examined. We count the number of windows in the 0.1% empirical tail to which each term was associated (“Windows” column); rather than reporting those windows, we report the genes within those windows that were associated with the given GO term (in the “Genes” column). 4.7. FIGURES AND TABLES 61

Table 4.2: Significantly enriched GO terms in the top empirical 0.1% of windows in each population, continued. Population Statistic GO term* Name of GO term p value Windows Genes Mandenka XP-EHH CEU GO:0007588 excretion 0.00125 2 HMOX1, KCNK5 XP-EHH CEU GO:0006813 potassium ion transport 0.00855 3 ABCC8, KCNJ11, KCNJ12, KCNK5 XP-EHH CEU GO:0045893 positive regulation of 0.00805 2 MAP2K3, TCF3 transcription, DNA- dependent XP-EHH CEU GO:0016311 dephosphorylation 0.01335 2 PTPRS, PTPRN2 XP-EHH YRI GO:0002504 (also P) antigen processing and 0.0 2 HLA-DRB5, HLA- presentation of peptide DRB9, HLA-DQB1, or polysaccharide antigen HLA-DQA1 via MHC Class II XP-EHH YRI GO:0019882 (also P) antigen processing and 0.00075 2 HLA-DRB5, HLA-DRA, presentation HLA-DRB9, HLA- DQB1, HLA-DQA1 XP-EHH YRI GO:0006955 (P) immune response 0.00575 3 IL5, HLA-DRB5, HLA- DRA, HLA-DRB9, HLA- DQB1, HLA-DQA1 XP-EHH KHB GO:0050850 positive regulation 0.0 2 FCER1, P2RX5 of calcium-mediated signaling XP-EHH KHB GO:0009991 response to extracellular 0.0 2 RASGRP4, RPS19 stimulus Yoruba XP-EHH MKK GO:0051056 regulation of small GT- 0.0053 2 NF1, RAPGEF4 Pase mediated signal transduction XP-EHH MKK GO:0042552 myelination 0.0085 2 SBF2, GAL3ST1 XP-EHH MKK GO:0050890 cognition 5x10−5 2 CTNS, NF1 XP-EHH MKK GO:0007420 brain development 0.00825 2 CTNS, NF1 XP-EHH MKK GO:0007507 heart development 0.01335 2 PKP2, NF1 XP-EHH KHB GO:0050850 positive regulation 0.0002 2 FCER1A, P2RX5 of calcium-mediated signaling XP-EHH KHB GO:0009653 anatomical structure 0.006 2 GCM1, ROD1 morphogenesis XP-EHH KHB GO:0007243 intracellular protein ki- 0.0091 2 GSG2, ICK nase cascade HapMap YRI XP-EHH CEU GO:0007588 excretion 0.0013 2 SCNN1B, HMOX1 XP-EHH CEU GO:0001666 response to hypoxia 0.00395 2 SCNN1B, HMOX1 XP-EHH CEU GO:0006814 sodium ion transport 0.02065 2 SCNN1B, ACCN5 XP-EHH CEU GO:0006355 regulation of transcrip- 0.00835 7 CAMTA1, SCMH1, tion, DNA-dependent TCF3, JMJD2B, MCM5, AC093376.5, DLX5, DLX6 XP-EHH MKK GO:0048593 camera-type eye morpho- 0.00015 2 NF1, RING1 genesis XP-EHH MKK GO:0030199 collagen fibril organiza- 0.0003 2 NF1, COL11A2 tion XP-EHH MKK GO:0030183 B cell differentiation 0.00095 2 CD79A, POU1F1 XP-EHH MKK GO:0006836 neurotransmitter trans- 0.00185 2 SLC32A1, SLC6A20 port XP-EHH MKK GO:0006512 ubiquitin cycle 0.0021 4 USP31, RING1, TULP4, HECW1 XP-EHH MKK GO:0006869 (also P) transport 0.00385 2 OSBP2, APOL5 XP-EHH MKK GO:0006955 (also P) immune response 0.00525 3 CD79A, CCR9, AL645940.4, AL645931.7, HLA- DPB1, HLA-DPA1 XP-EHH KHB GO:0050850 positive regulation 0.0001 2 FCER1A, P2RX5 of calcium-mediated signaling XP-EHH KHB GO:0051056 regulation of small 0.00505 2 SIPA1L3, RASGRP4 GTPase-mediated signal transduction 5

Evolution of Height and Skin Pigmentation in the =6 Khomani Bushmen

Julie M. Granka∗, Christopher R. Gignoux§, Cedric J. Werely†, Marlo Möller†, Justin Myrick∗∗, Jeffrey M. Kidd§§, Eileen G. Hoal†, Carlos D. Bustamante††, Marcus W. Feldman∗, Brenna M. Henn‡

* Department of Biology, Stanford University, Stanford, CA, USA § University of California, San Francisco, CA, USA † Division of Molecular Biology and Human Genetics, University of Stellenbosch, South Africa ** Department of Anthropology, University of California, Los Angeles, CA, USA §§ Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA †† Department of Genetics, Stanford University, Stanford, CA, USA ‡ Department of Ecology and Evolution, State University of New York, Stony Brook, NY, USA

Abstract

No two traits are as immediately recognizable for their variability among and within human populations as height and skin pigmentation. Many evolutionary hypotheses have been proposed to explain their vari- ability, but relatively little is known about the genetic basis of these traits across populations. The KhoeSan hunter-gatherers of southern Africa, believed to be one of the world’s most ancient human populations, have been the focus of extensive anthropological research for their subsistence strategy, “click” language, and great genetic diversity. While anthropologists have also documented the short stature and light skin pigmentation of the KhoeSan in comparison to most other African populations, the genetic basis of these phenotypes in this population remain completely unknown. For the first time, we analyze the heritability of height and skin pigmentation in 103 6=Khomani Bushmen individuals of the Northern Cape of South Africa, and perform a genome-wide association analysis for these traits using almost 300,000 single nucleotide polymorphisms (SNPs). Using both genetic and kinship methods, we obtain estimates of heritability for height that agree with previous studies (≈ 0.80), find no evidence for heritability of darkening of skin pigmentation (tanning), and find that most variation in innate skin pigmentation (≈ 99%) has the potential to be explained by the SNPs in our dataset. Variation in amounts of European, Bantu, and KhoeSan ancestry strongly contribute to variation in innate skin pigmentation; however, continental ancestry does not contribute significantly to vari- ation in height or tanning. Finally, in a genome-wide association study, we identify several loci associated

62 63

with height that have large effect sizes of ≈ 5-10 cm and are located near genes previously identified to be associated with height. We find no loci with a significant effect on skin pigmentation variation; because of skin pigmentation’s high heritability and its strong relationship with ancestry, we believe admixture mapping to be more appropriate for future research. Our study emphasizes that analyses of phenotypically and genet- ically diverse endogamous populations have the potential to reveal novel insights into the genetic basis and evolutionary history of complex traits. 64 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

5.1 Introduction

A major aim of research in human genetics has been to characterize phenotypic variation by estimating the rel- ative roles of genes and the environment, as well as natural selection, in affecting complex traits. No two traits are as immediately recognizable for their variability among and within human populations as height and skin pigmentation. Both easily measurable, the traits have been studied historically by physical anthropologists, and more recently by human geneticists using various methods including genome-wide association studies (GWAS), candidate gene approaches, animal models, and inferences of natural selection [14, 142, 149]. De- spite their relatively frequent study, much remains to be understood about the genetic loci underlying their variation, as well as the role of natural selection and adaptation in maintaining this variation across and within human populations. Human height shows great diversity among human populations. While the average height of European males is approximately 178 cm [287], average heights of “Pygmy” males of Africa, Southeast Asia, and South America range from 140-160 cm [204]. The latter variation has led to extensive research and a num- ber of evolutionary explanations for the apparent convergence of short stature in Pygmies. One hypothesis is that a limitation in food resources may have exerted selective pressure for reduced body size to facilitate a reduction in caloric intake [204, 248]. Smaller stature could also have been an adaptation for increased mobility for tasks such as foraging, tree climbing, or maneuvering through dense forest cover [204]. Though it fails to explain the evolution of Pygmies in non-tropical environments, another hypothesis suggests that since thermoregulation by evaporative cooling is less effective in tropical environments, shorter stature may have evolved to generate less body heat [204]. A final hypothesis suggests that the Pygmy phenotype is a life-history trade-off: because of high mortality, growth cessation occurs earlier to allow for earlier reproduc- tion [19, 171]. Interestingly, taller stature has itself been proposed to have been under positive selection in European populations [281]. Uncertainty regarding the environments of ancestral human populations complicates the assessment of the selective forces that have acted on human height. Knowing whether and why certain populations may have evolved towards shorter or taller stature can be aided by understanding height’s genetic basis. Candidate gene studies and genome-wide association studies have identified many loci associated with stature in human populations [20, 130, 149, 183, 192, 281, 311]. However, although height is known to be highly heritable (narrow sense heritability is believed to be ≈ 80%), the proportion of phenotypic variance explained by associated genetic variants is only ≈ 5-10% [149, 310, 313]. Characterizing the genetic basis of height in diverse populations is crucial to reconstructing its evolutionary history. Unlike human genetic variation in general, a majority of variation in human skin pigmentation can be explained by between-population structure [168]. Skin pigmentation is highly correlated with latitude, with equatorial populations having generally darker pigmentation [14, 222]. Due to this apparent relationship with ultraviolet radiation (UVR), a number of evolutionary hypotheses for variation in pigmentation have been proposed. Darker skin can provide a protection against sunburn and cancer, as well as against photolysis of the nutrient folate [127, 168, 199, 260]. Thus, in geographic regions with high levels of UVR, dark pigmentation 5.1. INTRODUCTION 65

could have been a response to any number of these selective pressures. Under low UVR conditions where sufficient Vitamin D is not acquired through diet, lighter skin may have evolved as an adaptation to ensure adequate Vitamin D synthesis [199]. Additional selective pressures on pigmentation are possible, including bacterial defense and sexual selection [199, 260].

As with height, human skin pigmentation is believed to be highly heritable [14]. A number genes as- sociated with human pigmentation, many with strong signals of selection, have been identified primarily in European and Asian populations [42, 52, 125, 125, 168, 180, 207, 233, 260, 260]. Again, debate regarding the selective pressures underlying human skin pigmentation could be resolved with an even more thorough characterization of its genetic basis, but the genetic determinants of pigmentation identified from a larger global sample will be needed [14, 125, 126, 168, 180, 199, 221, 222, 260].

We focus our study of height and pigmentation on “Bushmen” hunter-gatherers of the Kalahari desert, believed to be one of the world’s most ancient human populations based on their high genetic diversity and low levels of LD [101, 103, 143, 178, 242, 274, 275]. While the KhoeSan have long been studied by linguists for their Khoisan language (characterized by the presence of click consonants), anthropologists have also stressed the short “pygmoid” stature of the KhoeSan in comparison to Europeans and central African Bantu speakers [133, 193, 194, 239, 277, 300, 308]. Anthropologists have also emphasized the unique skin pigmentation of the KhoeSan, yellow in tone and relatively light, intermediate between that of European and central African populations [133, 193, 194, 204, 239, 277, 285, 300, 308]. Despite their potential to contribute to knowledge of possibly ancestral variation in human height and pigmentation, the KhoeSan have not been studied jointly for their phenotypic and genetic variation.

We examine height and skin pigmentation with genome-wide single nucleotide polymorphism (SNP) data in a sample of 6=Khomani Bushmen, a KhoeSan population in the Northern Cape of South Africa. The 6=Khomani Bushmen include two Khoisan-speaking groups: the N|u, who traditionally practiced a hunter- gatherer lifestyle, and the Nama, a widespread KhoeSan group associated with the spread of pastoralism in southern Africa. That the 6=Khomani Bushmen have experienced extensive admixture with neighboring Bantu groups during the past 800 years, as well as with European Dutch settlers during the past 500 years, can be used to further explore genetic contributions to phenotypic variation via admixture patterns [202]. For example, given knowledge of worldwide height and pigmentation variation, we hypothesize that taller stature may be correlated with increased European and Bantu ancestry, shorter stature with increased KhoeSan an- cestry, dark pigmentation with increased Bantu ancestry, and light pigmentation with increased European and KhoeSan ancestry. The phenotypic diversity and unique population and admixture history of the 6=Khomani Bushmen make them an ideal population for studying the evolutionary history of, as well as the contribution of different ancestries to, height and pigmentation.

In a dataset of 103 individuals, we estimate heritability, explore relationships of skin pigmentation and height with genetic ancestry, and perform genome-wide association studies to identify loci associated with each phenotype (height, innate skin pigmentation, and darkening of skin pigmentation (tanning)). Our study 66 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

underscores that analyses of phenotypic and genetic data in diverse human populations can provide novel insights into these two well-studied phenotypes.

5.2 Methods

5.2.1 Samples

Sampling of the 6=Khomani Bushmen took place in the in the Northern Cape of South Africa in the southern Kalahari Desert region (near Upington and neighboring villages) in 2006, 2010, and 2011. Institutional re- view board (IRB) approval was obtained from both Stanford University and the University of Stellenbosch, South Africa. 6=Khomani Bushmen N|u-speaking individuals, local community leaders, traditional leaders, nonprofit organizations, and a legal counselor were all consulted regarding the aims of the research before collection of DNA [103]. All individuals gave signed written consent, with a witness present, before partici- pating. Individuals collected in 2006 were re-consented under an updated protocol. Ethnographic interviews of all individuals were conducted, including questions about age, language, place of birth, and ethnic group of the individual and of his/her mother, maternal grandmother, father, and paternal grandfather. We recorded the relationships between any sampled individuals if revealed during the interview. Ages of older individuals were verified with separate interviews regarding reproductive history. DNA was obtained via saliva, collected using Oragene saliva collection kits (DNAGenotek, Ontario, Canada). A portable reflectance spectrophotometer (DermaSpectrometer DSMII ColorMeter, Cortex Technology,

Hadsund, Denmark) was used to measure skin pigmentation (melanin content, M = log10(1/[% red re- flectance])) in 103 individuals. The device was standardized to 0 as suggested by the manufacturer twice a day while sampling. Five measurements of M were taken on each of the left and right upper inner arms (areas less affected by sunlight and representative of constitutive skin pigmentation), as well as on the dorsal side of the left or right wrist (an area more exposed to sunlight) for a measure of tanning [250]. Using an anthropometer, two to five replicate measurements of height (in centimeters) were taken of 102 individuals (the height of one individual could not be measured); measurements standing against a wall were not possi- ble. Measurements of all individuals were taken by J.M.G., and recorded by a different individual. For the remainder of the analyses, we examined the trimmed phenotype means (highest and lowest values removed); we also averaged the inner arm skin pigmentation measurements over the two arms. 18 individuals were previously genotyped on the Illumina 550K platform [103] and 68 individuals were genotyped on the Illumina OmniExpress array, for a total of 86 genotyped and phenotyped individuals. (Some individuals were genotyped on the 550K platform but had no corresponding phenotype data, for a total of 103 genotyped individuals.) Overlap between the two genotype datasets for all genotyped individuals was determined to be 298,346 SNPs using the program PLINK (version 1.07; Purcell et al. [214]; http:// pngu.mgh.harvard.edu/purcell/plink). Both datasets were standardized to Human Genome Build hg19 (NCBI GRCh37). 5.2. METHODS 67

5.2.2 Relatedness

Pedigrees of all sampled individuals were constructed based on relationships recorded during ethnographic interviews. Estimates of identity by descent (IBD) from SNPs were calculated using the program BEAGLE [35]. These estimates were used to verify all relationships inferred from ethnographic data (as in Henn et al. [104]) and to infer and confirm additional ones. Pedigree plots and kinship matrices were constructed using the kinship package in R (http://www.r-project.org).

5.2.3 Ancestry Estimation

For estimation of individual ancestry proportions, data were merged with additional samples that could be informative about the admixture history of the 6=Khomani Bushmen. Populations included from the Cen- tre d’Étude du Polymorphisme Humain Human Genome Diversity Project (CEPH-HGDP) (sample sizes in parentheses) were the South African Bantu (8), Kenyan Bantu (11), Namibian San (5), Mozabites (29), and French (28). We also included data of 12 Namibian San individuals from Schuster et al. [242], as well as individuals from the Hadza (17) and Sandawe (28) of Tanzania, described by Henn et al. [103]. Populations included from the HapMap were: Yoruba trio parents from Ibadan, Nigeria (YRI) (100), CEPH Utah resi- dents with ancestry from northern and western Europe (CEU) trio parents (88), and Maasai trio parents from Kinyawa, Kenya (MKK) (46 individuals; related individuals NA21384, NA21475, NA21399, NA21365, NA21362, NA21382, NA21423, NA21453, NA21615, and NA21634 were removed; see Pemberton et al. [201]) [121, 122, 123]. After merging the SNPs genotyped in the HapMap, CEPH-HGDP, and South African samples, and remov- ing SNPs with a missing genotype rate > 5% and SNPs with minor allele frequency < 1% (i.e. singletons), a total of 284,279 SNPs remained. All datasets were merged to Human Genome Build hg19 as above. We ran the program ADMIXTURE [5] on this merged dataset to estimate ancestral clusters and admixture proportions in all individuals. Only SNPs not in close linkage disequilibrium (LD) (r2 < 0.9) were analyzed (all other SNPs were removed using the program PLINK), leaving a total of 259,506 SNPs. Since our dataset of 6=Khomani Bushmen included many related individuals (see Figure 5.1), we ran ADMIXTURE in six separate runs of unrelated individuals, using all default parameters in unsupervised mode. This ensured that sets of related individuals were not identified as ancestral clusters. We examined a range of numbers of ancestral populations (K), from 4 to 13, and found that a large number of ancestral clusters (K > 7) produced cluster inferences that were unstable. We chose K = 6 as the best representation of the number of ancestral populations for the 6=Khomani Bushmen and other included populations (see Results). We repeated the ADMIXTURE inference 25 times for each of the six separate runs, and averaged the results over the 25 runs using the program CLUMPP (with 100 random input sequences) [128]. These averaged values were then aggregated from the six runs into one plot. To verify the accuracy of the above ancestry estimation, we compared the estimated ancestry proportions 68 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

against each individual’s self-reported ancestry based on ethnographic interviews. For each individual, we created a measure of ancestry based on the reported ancestries of their maternal and paternal relatives. For an ancestry A, the self-reported measure R was calculated as:

R(A) = 0.5(I{m=A} + I{f=A}) + 0.25(I{gma=A} + I{gpa=A}) (5.1) where I{x=A} indicates whether relative x (m = mother; f = father; gma = maternal grandmother; gpa = paternal grandfather) was reported to be of ancestry A. Ancestries (represented by A) were: European (including the reported ethnicities “Dutch” and “Coloured”), European/Griqua (including the previous eth- nicities as well as “Griqua,” another coloured South African group); San (including the reported ethnicities “Bushmen,” “San,” “6=Khomani Bushmen,” and other synonyms); San/Nama (the previous ethnicities as well as “Nama”); and Bantu (including the ethnicities “Tswana,” “Black,” “Zulu,” “!Xhosa,” and “Ovambo”). The statistical package R was used to assess significance of the Pearson correlation of the self-report measures (R(A)) with ancestry as inferred by ADMIXTURE. These correlations were examined for all 103 genotyped individuals, including those for whom phenotype information was absent.

5.2.4 Statistical Analyses

Violin plots for height and skin pigmentation measurements of genotyped 6=Khomani Bushmen individuals were drawn using the vioplot() function in the vioplot package in R. As a measure of the amount of tanning of an individual during their lifetime, we subtracted each individual’s inner arm pigmentation measurement from their wrist pigmentation measurement. Regressions of height, inner arm skin pigmentation, and tanning (wrist minus inner arm pigmentation) on sex, age, and ancestry proportions were performed using a linear mixed model, allowing all individuals to be included despite close relatedness (see Figure 5.1). The model for the phenotype, Y , is:

Y = Xβ + Zb + , (5.2) where X and Z are vectors of the fixed and random effects, respectively. Sex, age, and ancestry comprise the fixed effects, and all individuals comprise Z, the random effects. The β coefficients estimated for this model 2 include βsex, βage, and βancestry for a given ancestry. The random coefficient b is distributed as N(0, σ Θ), where the matrix Θ corresponds to the kinship matrix (between pairs of individuals) constructed from the 2 2 2 inferred pedigrees (see Figure 5.1). The error term, , is distributed as N(0, σ ). Parameters β, σ and σ were estimated using maximum likelihood with the lmekin() function in the package kinship in R, for all 86 phenotyped and genotyped individuals (85 individuals for height). Estimates of narrow-sense heritability (h2), the proportion of phenotypic explained by additive genetic variance, were calculated using several methods (for a discussion see Visscher et al. [288] and Zaitlen and Kraft [313]). The first method used the linear mixed model above (Equation 5.2). h2 can be estimated 2 2 2 as σ /(σ + σ ), or the fraction of phenotypic variance explained by the kinship matrix. For height, fixed 5.2. METHODS 69

effect covariates of sex and age were included for estimation of h2; for pigmentation, no covariates were included (see Results). This analysis was performed on all 103 individuals with phenotypic information (102 individuals for height).

We also estimated heritability from the regression of offspring phenotype on parent phenotype for 18 parent/offspring pairs for which height and skin pigmentation information was available; here, h2 is equal to twice the regression coefficient [162]. For height, h2 was calculated using the residuals from the regression of height on sex and age (residuals from the mixed model of Equation 5.2; see Results); h2 for pigmentation was obtained from the averaged arm values without accounting for covariates (see Results). Since not all parent/offspring pairs were independent (for instance, multiple children from the same parent were included our dataset), we obtained heritability estimates using all parent/offspring pairs, as well as several independent sets of pairs.

We also estimated heritability using the GCTA method of Yang et al. [310], which calculates a lower bound for h2 as the proportion of trait variability explained by genotyped SNPs. This estimate can also be interpreted as an upper bound for the fraction of phenotypic variance that can be explained by SNPs identified by a GWAS (see Zaitlen and Kraft [313]). The model of this method takes a similar form to Equation 5.2; however, the random coefficient b is distributed as N(0, σ2Θ) where Θ is represented by a “genetic relationship matrix” calculated from genotyped SNPs instead of by the kinship matrix. We first estimated the genetic relationship matrix (between pairs of individuals) using all default parameters of GCTA, and with this matrix, we then used GCTA to estimate the variance parameters and calculate heritability by restricted maximum likelihood. For height, we used sex and age as covariates; for pigmentation, we used no covariates. Because this analysis requires genetic information, it was performed on only 86 phenotyped and genotyped individuals (85 individuals for height).

For each phenotype, we performed a genome-wide association study on 297,963 SNPs (with minor allele frequency > 0.01) in all phenotyped 6=Khomani Bushmen individuals, estimating a linear mixed model for each SNP using the program EMMAX [138] to account for substructure and relatedness of individuals [209]. The linear mixed model has a similar form as Equation 5.2, but includes SNP genotype as a fixed effect

(with corresponding coefficient βSNP ); as above, a pairwise genetic relatedness matrix between individuals, approximated from genotyped SNPs, is used to represent the matrix Θ (see Kang et al. [138]). We estimated this pairwise genetic relatedness matrix between individuals using the Balding-Nichols (BN) model with EMMAX, as recommended by the authors [138], and plotted and clustered the matrix using the heatmap.2() function from the R package gplots. Parameters of the regression model were then estimated for each SNP by restricted maximum likelihood using EMMAX. For height, sex and age were included as covariates; for pigmentation, ancestry proportions in the San, European, and/or Bantu clusters (see Results) were included as covariates. Quantile-Quantile (QQ) plots of estimated p-values were plotted using the qqunif() function in the R package gap. Plots of candidate regions were created using LocusZoom (http://csg.sph.umich. edu/locuszoom/; Pruim et al. [213]), and recombination rates were plotted using 1000 Genomes Pilot 1 SNP calls for 59 Yoruba (YRI) individuals (November 2010; 1000 Genomes Project Consortium et al. [1]). 70 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

To identify the quantitative effect of identified loci on the phenotypes (see Results), we estimated effect

sizes under a model where mean(YBB) = −a, mean(YBb) = d, and mean(Ybb) = a, where YXX is the pheno- type given genotype XX at the locus; B and b are the alternative alleles at a locus [162]. For estimation of a and d for height loci, we set Y as the residuals from a regression of height on sex and age from a linear mixed model using the kinship matrix estimated from the inferred pedigrees (Figure 5.1) as the variance/covariance matrix Θ (see above).

5.3 Results

5.3.1 Ancestry

Pedigrees of sampled individuals are shown in Figure 5.1, and includes relationships between 50 genotyped and phenotyped individuals based on IBD and ethnographic information; the remaining 36 genotyped and phenotyped individuals were not related to others in the dataset. Relationships between individuals include parent/offspring, avuncular, grandparent/grandchild, cousins, and others. For each individual, ancestral clusters and admixture proportions in these clusters were estimated using the program ADMIXTURE [5] with various values of K, the number of ancestral populations (Figure 5.2). Because our sample included related individuals, we ran ADMIXTURE in six separate runs of unrelated indi- viduals as described in the Methods. The K with the lowest cross-validation error was 4, 5, or 6, depending on the run (Figure 5.3); cross-validation error increased greatly for K > 6 for all runs. We believe results at K = 6 to be most representative of the admixture of the 6=Khomani Bushmen (Figure 5.2). At K = 6, major admixture clusters corresponded to European (orange cluster), Bantu (light blue), Eastern African (blue), Mozabite (green), Hadza (yellow), and San (purple). In the 6=Khomani Bushmen, the major ancestries rep- resented are Bantu, San, and European, with a minor amount of Eastern African ancestry. For phenotyped individuals, the mean amount of San ancestry per individual was 0.717 (standard deviation (sd) = 0.19); mean European ancestry was 0.129 (sd = 0.15), mean Bantu ancestry was 0.094 (sd = 0.114), and mean East African ancestry was 0.049 (sd = 0.043). At K = 7, all but one of the six runs of unrelated 6=Khomani Bushmen individuals resulted in a split

between the Namibian San (purple cluster), and South African San (pink). The FST between these two

clusters as calculated by ADMIXTURE had an average of 0.054 across the five runs; this FST is low compared

to FST between other inferred clusters (see also Henn et al. [103]). While collecting samples, we noticed that older individuals tended to be less admixed than younger individuals. In order to statistically assess this bias in our sampling, we performed a simple linear regression of ancestry on age for phenotyped individuals using R. We found a significant positive relationship of age on the proportion of San ancestry (p = 0.007) and significant negative relationship of age on proportion European ancestry (p = 0.045) (see Figure 5.4). Similarly, Pearson correlations between ancestry and age were significant for the proportion of San ancestry (r = 0.295; p = 0.006) and European ancestry (r = -0.262; p = 0.015), but not for Bantu ancestry (r = -0.06; p = 0.560) or East African ancestry (r = -0.211; p = 0.051). 5.3. RESULTS 71

We also tested the concordance of our genetic admixture estimates with self-reported ancestry obtained from ethnographic interviews for all genotyped individuals (Table 5.1). Despite the crudeness of our self- reported ancestry measure which included information about only one grandparent on each of the maternal and paternal sides (see Equation 5.1), we found significant correlations between self-reported European, Bantu, and San ancestry and genetic estimates. Thus, the ancestries estimated by ADMIXTURE appear to be appropriate.

5.3.2 Phenotypes

Measurements of height and skin pigmentation (M) are shown for genotyped individuals in Figure 5.5 and for all phenotyped individuals in Table 5.2. For genotyped individuals, male and female heights were signifi- cantly different; average female and male heights were 151.11 cm and 163.64 cm, respectively (Wilcoxon test p-value = 7.43x10−10). Skin pigmentation (M) did not differ by sex (Wilcoxon test p-value = 0.832); how- ever, wrist minus inner arm pigmentation, as a measurement of tanning, was significantly different between males and females (Wilcoxon test p-value = 0.003). All three phenotypes were roughly normally distributed (Figure 5.5); for each phenotype, Kolmogorov-Smirnov tests performed in R showed no significant differ- ence from a normal distribution (all p-values > 0.05 across all individuals as well as separately for males and females). We used a linear mixed model to estimate the effects of sex and age on each phenotype for genotyped individuals (see Methods). For height, sex and age were significant (βmale = 12.632, p = 0; βage = -0.098, p = 0.003), and were included as covariates in future analyses (see Figure 5.6). An interaction term between sex and age was not significant. Age and sex were not significant covariates for arm skin pigmentation. For mea- surements of wrist minus inner arm pigmentation (tanning), both sex and age were significant predictors, with males and older individuals having a greater positive difference between their wrist and arm pigmentations

(βmale = 6.375, p = 0.001; βage = 0.176, p = 0) (see Figure 5.7). We made several estimates of narrow-sense heritability, or h2, for both height and skin pigmentation (Table 5.3). Heritabilities for height estimated from parent/offspring pairs (after accounting for sex and age) ranged from 0.608 to 0.895, with very large standard errors, likely due to the small number of pairs (see Table 5.3). Heritabilities for height using a mixed model given our inferred pedigree produced a higher estimate (hˆ2 = 0.999). The lower-bound hˆ2 estimate calculated using GCTA is 0.588, though with a large standard error, similar to the value of 0.45 reported by Yang et al. [310]. Arm pigmentation heritability estimates using parent/offspring regressions ranged from 0.409 to 0.808, again with large standard errors (Table 5.3). Pigmentation heritability was estimated to be 0.999 using the inferred pedigree with a linear mixed model; in agreement with this estimate, the lower-bound hˆ2 estimate using GCTA was 0.999 and highly significant. A large proportion of variation in skin pigmentation appears to be explained by genotyped SNPs. In contrast, for measurements of tanning (wrist minus inner arm pigmen- tation), heritabilities using parent/offspring pairs are all of small magnitude, sometimes negative, and highly variable (Table 5.3). The proportion of variance explained by genotyped SNPs for the tanning measurement 72 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

using the GCTA method (0.385) is the smallest of all three phenotypes, and is not significant. Thus, genetic architecture of tanning seems to be very different than that of height and constitutive skin pigmentation. Using the mixed model regression of Equation 5.2, we tested whether ancestry proportions estimated from ADMIXTURE were significant predictors of height and skin pigmentation. A separate regression of height accounting for sex and age was performed on each ancestry; Figure 5.8 shows the results of the regression of height on sex, age, and proportion European ancestry. No ancestry (neither European, Bantu, San, or East

African) was a significant predictor of height (Table 5.4), and the estimated βancestry coefficients were near 0 for all ancestries. Given the relationship between ancestry and age in our dataset (Figure 5.4), we also tested for an interaction between each ancestry and age; however, neither the interaction terms nor ancestry terms were significant (results not shown). When more than one ancestry was included in the model, no ancestries were significant predictors of height (results not shown). In contrast, ancestry does appear to have a strong effect on skin pigmentation variation. In separate re- gressions of pigmentation on each ancestry (not including sex and age as covariates), Bantu and European ancestry proportion were both significant; San and East African ancestry were not significant (Table 5.4). As expected, Bantu ancestry significantly increased skin pigmentation, whereas European ancestry significantly decreased skin pigmentation (Figure 5.9). Combinations of Bantu and San, European and San, and Bantu and European ancestries also yielded significant regression coefficients. However, a model including all three an- cestries gave no significant coefficients, likely due to the covariance between the proportions (Table 5.4). No regression models including East African ancestry gave significant regressions (results not shown). Agreeing with our low estimates of heritability for the tanning measurement (wrist minus inner arm pigmentation), after accounting for sex and age we found no significant effect of any ancestry on the amount of tanning (Table 5.4). We then asked whether there are large effect SNPs correlated with skin pigmentation and height in the 6=Khomani Bushmen. In order to correct for population and family structure in a GWAS framework, we used the program EMMAX [138]. First, a genetic relationship matrix between all pairs of individuals was calculated with EMMAX using the Balding-Nichols model in order to account for relatedness of individuals and substructure (see Figure 5.10). Pairs of individuals with high values of genetic relatedness correspond to relatives in the pedigree of Figure 5.1, indicating that the method can appropriately account for the inclusion of relatives. We then used EMMAX to calculate a p-value and β coefficient for each SNP in a mixed model regression (as in Equation 5.2), accounting for the genetic relationship matrix in Figure 5.10. For height, we included sex and age as covariates, and the QQ plot of Figure 5.11A shows that p-values were uniformly distributed. Several SNPs had p-values slightly more extreme than expected by chance (see Table 5.5). However, no p- −7 values fell below the genome-wide significance cutoff of 1.68x10 (-log10(p) = 6.77; based on a Bonferroni correction at the α = 0.05 level) (Figure 5.12A). The top SNP, rs9315851 on 13, is in an intron in the gene VWA8, which is associated with ATP binding and ATPase activity according to the Gene Ontology (http://amigo.geneontology. 5.3. RESULTS 73

org) (Table 5.5 and Figure 5.12B). The effect allele (C), associated with taller stature, is the ancestral allele, and is at high frequency in the 6=Khomani Bushmen (79%) (Figure 5.13A and D). Interestingly, the gene VWA8 has been associated with body weight and body mass index (BMI) in the Framingham Heart Study [81]. Located 428,145 bases away (at position 42,631,719) in the gene DGKH is SNP rs6561030, found to be associated with body height in Japanese individuals [192]. Means of residuals of height accounting for sex and age by rs9315851 SNP genotype suggest a strong effect of the locus: CC (1.631 cm); CT (-2.457 cm); TT (-11.538 cm) (Figure 5.13A). Under a quantitative genetic model for the phenotype (see Methods), a = 6.584 cm, suggesting a large effect, and d = 2.497, suggesting dominance of the C allele. However, only three individuals had the TT genotype, compared to 30 and 52 with the CT and CC genotypes, respectively.

The second top SNP, rs2967055 on chromosome 5, is located in an intronic region surrounded by many genes (Table 5.5 and Figure 5.12C). Again, the allele associated with taller stature (G) is the ancestral allele, and at high frequency (Figure 5.13B and E). Interestingly, this SNP is 247 bases away from rs2937550 (at position 36,473,237), which was found to be associated with height in a meta-analysis of African Americans [183]. One of the genes located near rs2967055 is SLC1A3, which, in addition to being associated with episodic ataxia and glutamate transport, was also found to be associated with height in the NHLBI Family Heart Study (dbGaP Study Accessions: phs000221 and phs000501). SNPs in SLC1A3 have also been found to be associated with BMI and waist-hip ratio in the same study, as well as with LDL cholesterol levels [139]. Another nearby gene is NIPBL, which is associated with abnormal skeletal growth [183]. Means of the residuals of height accounting for sex and age again suggest a strong effect of the locus: GG (1.205 cm); AG (-4.894 cm); AA (-8.835 cm) (Figure 5.13B; under the described quantitative genetic model (see Methods), a = 5.020 cm and d = -1.077, suggesting some dominance of the A allele). As with the previous example, a majority of individuals (65) had the GG genotype, while only 19 had the AG, and only 1 had the AA, genotypes.

Finally, the third top SNP, rs2531691 on chromosome 10, is in an intron of the gene ENO4 (Table 5.5). The only variation in this SNP exists in sub-Saharan Africa and the derived allele is absent in the HGDP Namibian San; HGDP populations outside of Africa are nearly fixed for the T (ancestral) allele (Figure 5.13F). As with the previous associated SNPs, the ancestral allele is associated with taller stature and is at high frequency (97%) in our sample of 6=Khomani Bushmen. According to the Gene Ontology, associations with ENO4 include glycolysis and magnesium ion binding. SNPs in ENO4 have been found to be associated with heart disease and heart failure [150]. While the mean of height residuals (after accounting for age and sex) for TT individuals (0.432 cm) is greater than for CT individuals (-9.610 cm), there are only 6 individuals with the CT genotype, and no CC homozygotes (Figure 5.13C). Nonetheless, this suggests a large effect of the locus.

In our genome-wide association study for pigmentation using EMMAX, we do not account for sex and age as covariates; however, we do account for ancestry (see Table 5.4). Because of the non-independence of European, Bantu, and San ancestry proportions, only two ancestries were included at a time (European and Bantu, Bantu and San, or European and San); we also attempted to simultaneously include all three 74 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

ancestries as covariates. In Figure 5.11B, we show a QQ plot of p-values when including only European and San ancestries as covariates; QQ plots for all other ancestry combinations are similar. There are no outlying SNPs with extreme p-values based on the QQ plots of Figure 5.11B, and no SNP p-values fell below the genome-wide significance cutoff.

5.4 Discussion

Height and skin pigmentation vary greatly within and between human populations. While they are both easily measurable traits, a great deal remains to be understood about their evolutionary history and genetic basis. Given their high genetic diversity, short stature, and light skin pigmentation, Southern African Khoe- San hunter-gatherers can provide novel insight into the ancestral features and genetic architecture of these phenotypes. In our study of the 6=Khomani Bushmen of the Northern Cape of South Africa, we leverage the population’s recent admixture with Bantu and European populations to better characterize the genetic contribution to these two traits.

5.4.1 Ancestry Estimation

Genetic estimates of ancestry reveal that the 6=Khomani Bushmen are an admixed population, with many individuals having appreciable amounts of European and Bantu admixture (as previously reported by Henn et al. [103]). Genetic admixture estimates are consistent with self-reported ethnicity information, supporting that genetic ancestry measurements accurately reflect the admixture history of this population [202, 239, 277] (Table 5.1).

5.4.2 Height

With average female and male adults heights of 151.78 cm and 163.90 cm, respectively, our measurements of height in the 6=Khomani Bushmen were slightly greater than previous measurements in various KhoeSan pop- ulations of the Kalahari (Table 5.2). For instance, measurements of southern KhoeSan populations reported by Schapera [239], Tobias [277] and Dart [57] were 5-10 cm shorter for both males and females. Our esti- mates agreed more closely with measurements taken in more northerly KhoeSan and Nama (or “Hottentot”) populations, although our sample had slightly greater mean measurements [193, 194, 239, 277, 300, 308]. These measurement differences can be explained by the inclusion of Nama individuals in our sample, our inclusion of admixed individuals, difference in ages of sampled individuals, or sampling bias for shorter in- dividuals in previous studies [277]. Previous studies did, however, identify a large difference between male and female heights of ≈10 cm, as in our study (Table 5.2). As expected, height of the 6=Khomani Bushmen individuals was greater than most Pygmy populations, for whom adult male height is generally 140-160 cm [18, 20, 130, 204]. 5.4. DISCUSSION 75

We used several methods to obtain estimates of narrow-sense heritability (h2) for height (Table 5.3). While standard errors of our estimates were large because of small sample size, our point estimates agreed with previous estimates (≈80%). Our results suggest that height in the 6=Khomani Bushmen, as in other populations, has a large contribution from additive genetic variance. However, the SNPs genotyped in our study appear to explain only a moderate proportion of this variance (as found by Yang et al. [310]; see GCTA estimate of Table 5.3). In addition to genetic factors affecting height, environmental and nutritional factors may play a role in its variance [193, 202].

Interestingly, we found no effect of ancestry on height in the 6=Khomani Bushmen (Table 5.4 and Fig- ure 5.8). This is in contrast to Pygmy populations for whom other studies have claimed a correlation between non-Pygmy ancestry and increased height [18, 130]. However, although Becker et al. [18] found a significant positive relationship between non-Pygmy ancestry and height when pooling across several Pygmy popula- tions from Cameroon, Gabon, and Central African Republic (including Aka, Baka, and Bongo Pygmies), they did not find a significant relationship within Pygmy populations. In addition, though Jarvis et al. [130] found a significant association of height with non-Pygmy ancestry, their results may be biased by includ- ing multiple Pygmy populations (Baka, Bakola, and Bedzan), as well as Bantu-speaking populations from Cameroon. That we were able to detect a significant relationship between ancestry and skin pigmentation (see Results) suggests that our lack of relationship for height is not necessarily due to lack of power due to small sample size. Since we obtained relatively high point-estimates of height heritability, it seems that the loci affecting height are not strongly differentiated by ancestry. Differences in height among ancestral populations (European, Bantu, and KhoeSan) may not be great enough to produce a significant signal.

In a genome-wide association study, we identified several loci of large effect potentially related to height in the 6=Khomani Bushmen (Table 5.5 and Figure 5.13A, B and C). While none of these loci reached genome- wide significance, several SNPs appeared to be outliers from a uniform distribution of p-values (based on the QQ plot of Figure 5.11A). Promisingly, the associations we identified are located near or within genes that have previously been found to have modest associations with height or other body size characteristics (see Results and Figure 5.12).

Differences in height between genotypes for the identified loci suggest large effects for the quantitative contribution of each allele, with possible dominance effects. Effect sizes under an additive quantative genetic model (see Methods) were very large (≈ 5-10 cm), much higher than effect sizes estimated from previous studies (≈ 0.5 cm) (Figure 5.13A, B and C) [154, 183, 237, 297]. We note that for all three identified SNPs, the effect allele associated with taller height is at very high frequency in the 6=Khomani Bushmen; the few heterozygotes and homozygotes with the “short” allele could bias effect size estimates (Figure 5.13D, E, and F). Additionally, it is possible that unaccounted-for substructure (either due to admixture or relatedness) could cause spurious associations. Nonetheless, for each association (Table 5.5), the effect allele associated with increased height is the ancestral allele and major allele in the 6=Khomani Bushmen. The high frequency of the “taller” allele in the 6=Khomani Bushmen seems surprising given their shorter stature, and indicates the likely existence of additional loci underlying the overall short stature of the 6=Khomani Bushmen. 76 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

The prevailing view in human height research is that there are many loci of small effect that contribute to height variation, requiring large sample sizes to detect genetic associations [287]. Our study of a modest number of individuals suggests that loci with large effect sizes may also contribute to human height variation. Loci of smaller effect likely still contribute to height variation in the 6=Khomani Bushmen, supported by our estimates that common tag SNPs on the Illumina arrays explain 58% of the variance of height in our sample (see also Visscher [287]). Because the alleles associated with taller stature are the ancestral allele for all three identified height SNPs, this could imply that shorter stature is a derived trait in the 6=Khomani Bushmen (Figure 5.13). A number of hypotheses regarding the evolution of human Pygmy short stature may be relevant to KhoeSan populations, such as the evolution of short stature for increased foraging mobility [204], though other hypotheses such as thermoregulation in a rainforest environment and life-history tradeoffs are not applicable. While our analyses are a promising first step, further research is necessary in order to more fully understand the evolution of KhoeSan shorter stature. Additional samples would allow us to verify the loci we have identified and to identify other loci of large or small effect that may be associated with height variation.

5.4.3 Skin Pigmentation

Skin pigmentation measurements of the 6=Khomani Bushmen as measured by a spectrophotometer (mean M ≈ 54) are much greater than those of Europeans (≈ 20 < M < 30), just above those observed in Asians (≈ 40 < M < 50), in the low range of those of African-Americans (≈ 50 < M < 90), and lower than those observed in Melanesians (≈ 60 < M < 90) (measurements taken with the same measurement tool; Candille et al. [42], Norton et al. [189], Shriver and Parra [250]) (Table 5.2). Comparison of our measurements of skin pigmentation with previous studies of the Kalahari KhoeSan, which primarily reported qualitative mea- surements, were not possible [239, 300]. Age was not significantly related to skin pigmentation, confirming that measurements taken on the upper inner arm may reflect innate pigmentation instead of UV exposure. However, while previous studies have adjusted for effects of sex in pigmentation measurements, we found no significant effect of sex on inner arm pigmentation [42, 189]. Interestingly, in our study of tanning (wrist minus inner arm pigmentation) (Figure 5.5), males and older individuals tended to have a greater positive difference between the two pigmentation measurements (Figure 5.7). Age can be seen as a proxy for the amount of time that an individual has been exposed to UV radiation, causing older individuals having greater tanning. The effect of sex suggests that males in this population may experience more sun exposure than females, perhaps due to spending more time outdoors. Though less likely, it could also imply a greater intrinsic tanning ability of males compared to females. We obtained slightly higher estimates of heritability for innate pigmentation than those of previous stud- ies, which have ranged between 55% and 80% [85, 225] (Table 5.3). Though heritability estimates calculated from parent/offspring regressions in our study had large standard errors (as for height), the high point es- timates emphasize a strong genetic contribution to pigmentation. The GCTA method of Yang et al. [310] 5.4. DISCUSSION 77

yielded a significant heritability estimate of 0.99, suggesting that a large proportion of variability in pigmen- tation may be explained by the genotyped SNPs. In contrast, tanning (wrist minus inner arm pigmentation) did not appear to be heritable, especially when compared to heritability estimates obtained for height and innate arm skin pigmentation (Table 5.3) (the large difference in estimates between independent sets of parent/offspring pairs is due to several outlying pairs). Variability in tanning among 6=Khomani Bushmen individuals is likely to have a larger contribution from the environment and sun exposure than from genetic factors. Comparing the heritabilities among height, arm pigmentation, and tanning verifies that despite large standard errors, the point estimates of heritability that we obtain can be informative about the traits’ genetic bases. Much of the heritability of innate pigmentation appears to be explained by individuals’ ancestries, as expected based on worldwide distributions of pigmentation phenotypes [199, 222]. As hypothesized, Bantu ancestry is significantly associated with darker pigmentation, and European ancestry with lighter pigmenta- tion (Table 5.4 and Figure 5.9). While San ancestry alone is not a significant predictor of pigmentation, it is significant when included in a regression along with either European or Bantu ancestry, suggesting that there may exist KhoeSan-specific loci that contribute to variation in pigmentation. Given the strong associations of innate pigmentation with ancestry, genetic loci affecting pigmentation appear to be strongly differentiated by ethnicity [168]. In contrast to innate pigmentation, the amount of tanning does not appear to be affected by ancestry, agreeing with its low estimated heritability. Because of the identified associations with ancestry, we included ancestry proportions as covariates in a genome-wide association study of innate arm pigmentation. This analysis was complicated, however, by the non-independence of the Bantu, European, and San ancestry proportions. Because of this collinearity, only two out of the three ancestries were included as covariates at a time; however, this may not account for all genetic structure (see Figure 5.11B). We are unable to find any loci significantly associated with pigmentation. Our inability to identify loci associated with pigmentation in a GWAS is surprising, especially given the strong associations of pigmentation with ethnicity and the high heritability estimates of pigmentation estimated using the GCTA method. While it is possible that our small sample size may reduce our ability to identify novel associations, future studies using alternative methods may be more effective. First, instead of using ancestry proportions as covariates, accounting for ancestry using principal components calculated from genetic data may be more appropriate [10, 208]. Since principal components are by nature independent, they would allow for all underlying effects of ancestry to be incorporated. The strong effect of ancestry on pigmentation also suggests that admixture mapping will be an attractive tool for identifying loci associated with pigmentation in this population in the future [246, 249, 304]. With such an approach, KhoeSan-specific, European-specific, or Bantu-specific loci contributing to variation in pigmentation could be discovered. Given that genetic and phenotypic analyses of the KhoeSan have never before been performed, KhoeSan-specific pigmentation loci may reveal novel genes associated with pigmentation. The ancestral nature of the KhoeSan make them a particularly attractive population for further understand- ing human pigmentation. Not only would the discovery of novel genes provide insight into pigmentation’s 78 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

genetic basis, but identifying the ages of alleles associated with lighter or darker pigmentation could reveal the evolutionary history of human pigmentation across worldwide populations [25, 168].

5.4.4 Conclusions

Despite our small sample size, our analysis of 103 6=Khomani Bushmen reveals high heritability of both skin pigmentation and height, strong associations between ancestry and innate pigmentation, and several promising loci of large effect associated with height. We recognize that our power would be improved with a larger set of unrelated individuals as well as a denser set of genotypes, obtained potentially through imputation [166]. Nonetheless, this study emphasizes that analyses of phenotypically and genetically diverse individuals from endogamous populations have the potential to provide novel insights into the genetic basis and evolutionary history of complex traits – even those as often-studied as height and skin pigmentation.

5.5 Acknowledgments

We would like to thank the 6=Khomani San community, the South African San Institute (SASI), and our translators from the 6=Khomani San community for their assistance and support in carrying out this research. We would also like to thank Sandra Beleza, Oana Carja, and Karla Mendoza for helpful comments and dis- cussions. We would like to acknowledge the Morrison Institute for Population Studies at Stanford University and a Stanford University Department of Biology SCORE grant for partially funding this research. J.M.G is supported by the National Science Foundation (NSF) Graduate Research Fellowship, grant number DGE- 0645962. B.M.H. is supported by NIH grant R01HG003229. This research was supported in part by NSF grant GM28016 to M.W.F. 5.6. FIGURES AND TABLES 79

5.6 Figures and Tables X39 1014 039 1021 X38 1029 1033 037 1028 1030 050 X33 X21 073 017 X32 X22 X28 X20 1023 X27 X19 092 X44 043 X18 X25 X43 X17 X26 034 X29 078 X16 038 X15 1040 X30 X14 054 X12 052 X13 036 X34 1012 X9 1009 X8 072 016 1003 Not Phenotyped Genotyped 550 Not Genotyped Genotyped Omni X2 1000 080 X1 068 1022 067 055 095 1001 079 X23 076 075 X3 X29 1002 007 009 X41 070 019 X40 069 X10 1025 X11 X6 1037 X5 1036 047 1032 X4 1024 084 X7 045 090 087 X37 X36 X25 1016 042 X35 X46 X24 061 091 X42 X45

Figure 5.1: Inferred pedigrees for a subset of related sampled individuals. Relationships are inferred from ethnographic interviews and IBD estimation (see Methods). Circles are females, squares are males. Numbers below the circles or squares indicate the individual’s anonymized code; codes with an “X” denote individuals who were not sampled. Shading indicates the genotyping platform, and a slash indicates that the individual was not phenotyped (see legend). Note: some individuals were genotyped but not phenotyped; other individuals were phenotyped but not genotyped. 80 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

1.0 0.8 0.6 K=4 0.4 0.2 0.0

1.0 0.8 0.6 K=5 0.4 0.2 0.0

1.0 0.8 0.6 K=6 0.4 0.2 0.0

1.0 0.8 0.6 K=7 0.4 0.2 0.0

a h y CEU en oruba Hadza Frenc Y Maasai Mozabite Bantu_S Sandawe BantuK

HGDP Namibian San Schuster Namibian San

≠Khomani Bushmen (550K)≠Khomani Bushmen (Omni)

Figure 5.2: Individual ancestry proportions for the indicated number of ancestral populations inferred for phenotyped individuals using ADMIXTURE. Each vertical bar is an individual, whose ancestry is in- dicated by its colors. Individuals are sorted by population, listed below the plot. K, the number of ancestral populations, ranges from 4 to 7. Inferences were made for six runs of unrelated individuals, and then aggre- gated into one plot (see Methods). At K = 7, one of the six runs did not infer the same ancestral clusters as the others; as a result, there are fewer individuals represented in the bottom panel (see Results). 5.6. FIGURES AND TABLES 81

● ● ● ● ● ● ●

0.58 ● ● ● ● ● ● ● ● ● ● ●

● 0.56 ● ● ● ●

● Cross Validation Error Cross Validation ● ● ● ● ● 0.54 ●

4 6 8 10 12

K value

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1

−87500000 ● ● ● ● 2 ● ● ● ● ● 3 ● Log Likelihood ● 4 ● ● 5 ● ● 6 ● −89000000 4 6 8 10 12

K value

Figure 5.3: Cross-validation errors and log likelihoods for models estimated using ADMIXTURE for various numbers of ancestral populations. Each colored line indicates an independent run of unrelated 6=Khomani Bushmen individuals, as described in the Methods (Section 5.2.3). X-axes: K, the number of ancestral populations. Top panel y-axis: cross-validation error as inferred by ADMIXTURE [5]. Bottom panel y-axis: log likelihood of model. 82 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

● ● −0.001 ● −0.001

0.6 p = 0.045 p = 0.342 ● ● ● 0.4 ● ● ● ● ● ● ● 0.3 ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Proportion Bantu Ancestry ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● Proportion European Ancestry ● ● ●●●●●● ●●●●●●●●● ●●● ●● ●● ● ● ● ●●●●●●●● ●●● ● 0.0 0.0

20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90

Age Age

● ●● ●●● ● ● ● ● ● 1.0 ● ● ● −4e−04 ● ● ● ● p = 0.057

● ● 0.12 ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● 0.8 ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.08 ● ●●● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.04 ● ● ● 0.4 ● ● ● ●

Proportion San Ancestry 0.003 ● ● ●● ● ● p = 0.007 ● ● ●●●●●●●●●● ● ●● ●●● ●● 0.2 Proportion East African Ancestry 0.00 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90

Age Age

Figure 5.4: Ancestry proportions versus age of phenotyped 6=Khomani Bushmen individuals. Admix- ture proportions are inferred by ADMIXTURE, as in Figure 5.2. Regressions are simple linear regressions. Estimated coefficients and associated p-values are indicated in the corners of each plot. 5.6. FIGURES AND TABLES 83

A 180 170

● 160

● Height (cm) ● 150 140

All Males Females

B 80 70 60

● ● ● Average M Arm Average 50 40

All Males Females

C 30

● 20

● ● 10 (Wrist M) − (Arm M) 0 −10 All Males Females

Figure 5.5: Violin plots of height and skin pigmentation measurements for genotyped individuals. (A) Height (cm), (B) average arm pigmentation (M), and (C) tanning defined as wrist pigmentation minus inner arm pigmentation (Wrist M - Arm M) for all individuals, males, and females. Boxplots are drawn within the blue shaded regions, and the estimated kernel densities are drawn along each side. 84 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

● ● 180 12.549 180 p = 0 ● ● ● ● ● ● ● ● ● 170 ● 170 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 160 ● ● 160 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Height ● Height ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 150 ● ● 150 ● ● ● ● ● ● ● ● ● ● ● Males ● ● ● ● −0.098 ● Females ●

140 ● 140 p = 0.003 ● ●

0.0 0.2 0.4 0.6 0.8 1.0 20 30 40 50 60 70 80 90

Sex Age

Figure 5.6: Results from a linear mixed model regression of height on sex and age. Lines indicate the ˆ ˆ partial regression coefficients of height on sex (left panel), and height on age (right panel). βmale and βage coefficients, with their p-values, are located in the upper left and lower left corners of the respective plots. Points in red are males, black are females. Mixed model equation is in Equation 5.2. 5.6. FIGURES AND TABLES 85

6.375 ● ● ● 0.176 ● p● = 0.001 ● p = 0 ● ● ● ● ● ● ●

30 ● 30 ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 ● ● 20 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ●

10 ● 10 ● ● ● ● ● ● ● ● ●● ● ●

M Wrist − M Arm ● M Wrist − M Arm ● 0 ● ● Males 0 ● ● ● ● ● ● Females ● ● ● −10 −10 0.0 0.2 0.4 0.6 0.8 1.0 20 30 40 50 60 70 80 90

Sex Age

Figure 5.7: Results from a linear mixed model regression of tanning (wrist minus inner arm pigmen- tation) on sex and age. Lines indicate the partial regression coefficients of tanning on sex (left panel), and ˆ ˆ tanning on age (right panel). βmale and βage coefficients, with their p-values, are located in the upper left and lower left corners of the respective plots. Points in red are males, black are females. Mixed model equation is in Equation 5.2. 86 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

● ● 180 −0.365 180 12.555 ● ● p = 0.937 p = 0 ● ● ● ● ● ● 170 ● ●● 170 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

160 ● ● 160 ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● Height ● Height ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● 150 ● ● 150 ● ● ● ● ● ● ● ● ● ● ● ● ●

140 ●● 140 ●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Proportion European Ancestry Sex

● 180

● ● ● ● ● 170 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Males

160 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● Females Height ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 150 ● ● ● ● ● ● ● ● ● −0.098 ● ● 140 p = 0.004 ● ●

20 30 40 50 60 70 80 90

Age

Figure 5.8: Results from a linear mixed model regression of height on sex, age, and proportion Eu- ropean ancestry. Lines indicate the partial regression coefficients of height on European ancestry (top left panel), sex (top right panel), and age (bottom left panel), using the mixed model equation of Equation 5.2. ˆ ˆ ˆ βEuropean, βmale, and βage coefficients, with their p-values, are located in the corners of the respective plots. Points in red are males, black are females. 5.6. FIGURES AND TABLES 87

80 ● 80 ● ● −24.978 ● ●● ●● ● ●● p = 0.001 ●● ● ● ● ● ● ●

70 ●● 70 ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● 60 ● 60 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● 50 ● 50 ● ● ●● ● ● ● ● Average M Arm Average ● ● M Arm Average ●● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ● ●● ●● 33.103 ● ● ● ● ● 40 ● ● ● 40 ●●● p = 0.001 ● ●

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4

Proportion European Ancestry Proportion Bantu Ancestry

80 ● 80 ● 2.746 ● ● 0.32 ●● ● ●● p = 0.655 ● ● ● p = 0.991 ● ● ● ● ● ●

70 ● ● 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● 60 ● 60 ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● 50 ● ● ● ●● ● ● ●● Average M Arm Average ● ● M Arm Average ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ●● ● 40 ● ●● ● ●

0.2 0.4 0.6 0.8 1.0 0.00 0.04 0.08 0.12

Proportion San Ancestry Proportion Eastern African Ancestry

Figure 5.9: Results from linear mixed model regressions of inner arm skin pigmentation on proportion ancestry, for various ancestries. Each plot is a separate regression model for each ancestry, using the mixed ˆ ˆ ˆ model equation of Equation 5.2. Lines indicate the regression coefficients of the model. βEur., βBantu, βSan, ˆ and βEastAfr., with their p-values, are located in the corners of the respective plots. 88 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

Color Key

0 0.2 0.6 Value

SA060 SA1014 SA039 SA008 SA1035 SA1040 SA038 SA078 SA095 SA1022 SA062 SA075 SA076 SA088 SA040 SA081 SA087 SA045 SA083 SA1041 SA1039 SA049 SA1015 SA074 SA071 SA036 SA1009 SA016 SA1012 SA072 SA068 SA067 SA080 SA1003 SA1000 SA1028 SA1033 SA1030 SA084 SA1025 SA047 SA1032 SA1024 SA1037 SA069 SA1002 SA056 SA1031 SA086 SA089 SA1008 SA020 SA1038 SA064 SA1007 SA1006 SA073 SA1023 SA092 SA1001 SA079 SA009 SA007 SA019 SA046 SA1010 SA034 SA043 SA061 SA042 SA1011 SA1034 SA1029 SA091 SA1016 SA1021 SA1020 SA055 SA077 SA015 SA053 SA063 SA004 SA022 SA057 SA017 SA017 SA057 SA022 SA004 SA063 SA053 SA015 SA077 SA055 SA091 SA042 SA061 SA043 SA034 SA046 SA019 SA007 SA009 SA079 SA092 SA073 SA064 SA020 SA089 SA086 SA056 SA069 SA047 SA084 SA080 SA067 SA068 SA072 SA016 SA036 SA071 SA074 SA049 SA083 SA045 SA087 SA081 SA040 SA088 SA076 SA075 SA062 SA095 SA078 SA038 SA008 SA039 SA060 SA1020 SA1021 SA1016 SA1029 SA1034 SA1011 SA1010 SA1001 SA1023 SA1006 SA1007 SA1038 SA1008 SA1031 SA1002 SA1037 SA1024 SA1032 SA1025 SA1030 SA1033 SA1028 SA1000 SA1003 SA1012 SA1009 SA1015 SA1039 SA1041 SA1022 SA1040 SA1035 SA1014

Figure 5.10: Genetic relationships for pairs of individuals inferred from EMMAX using the Balding- Nichols method. Higher values in the matrix correspond to greater relatedness between individuals. Indi- viduals in the matrix were clustered to facilitate interpretation. Anonymized codes for each individual are indicated on the bottom and right sides of the matrix. 5.6. FIGURES AND TABLES 89

A

B

Figure 5.11: QQ plots of −log10 SNP p-values for genome-wide association studies of height and inner arm pigmentation using EMMAX. Y-axes indicate observed −log10(p-value), and x-axes indicate expected −log10(p-value) based on the uniform distribution. (A) Height analysis, where sex and age are included as covariates. (B) Inner arm pigmentation analysis, where the proportion European and the proportion San ancestry are included as covariates. 90 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

A

Height B Plotted SNPs

10 100 r2 rs9315851 Recombination r

8 0.8 80 0.6 0.4 alue)

v 6 60 0.2 (p ate (cM/Mb) 10 4 40 l o g

2 20

0 0

FOXO1 ELF1 KBTBD7 OR7E37P KIAA0564 DGKH TNFSF11

MIR320D1 WBP4 NAA16 AKAP11

MRPS31 KBTBD6 C13orf15

SLC25A15 MTRF1

SUGT1P3

MIR621

41.5 42 42.5 43 Position on chr13 (Mb) Height C Plotted SNPs

10 2 100 r rs2967055 Recombination r 0.8 8 80 0.6 0.4

alue) 6 60

v 0.2 (p ate (cM/Mb) 10 g

o 4 40 l

2 20

0 0

SPEF2 CAPSL LMBRD2 SLC1A3 NIPBL C5orf42 WDR70

IL7R UGT3A2 RANBP3L NUP155

UGT3A1 SKP2

MIR580

C5orf33

35.5 36 36.5 37 Position on chr5 (Mb)

Figure 5.12: Manhattan plots of −log10 SNP p-values for height association using EMMAX. Y-axis in- dicates observed −log10(p-value). Covariates included are sex and age. (A) Whole genome. SNPs along the x-axis are ordered by chromosome and position, and shaded by chromosome. (B) 1000kb region surround- ing rs9315851 (in yellow). Y-axis also indicates recombination rate (cM/Mb) estimated from African 1000 Genomes data. LD (r2) with rs9315851 is indicated for all other genotyped SNPs (see legend). Locations of nearby genes are listed below the plot. (C) 1000kb region surrounding rs2967055; see legend for part (B). 5.6. FIGURES AND TABLES 91

A rs9315851 D 15 10 5 0 Height Residuals −5 −10 −15

0 1 2

Number of C alleles

B rs2967055 E 15 10 5 0 Height Residuals −5 −10 −15

0 1 2

Number of G alleles

C rs2531691 F 15 10 5 0 Height Residuals −5 −10 −15

1 2

Number of T alleles

Figure 5.13: Effect sizes by genotype and allele frequencies in worldwide HGDP populations for top height SNPs. A), B), C): Boxplots of effect sizes for the indicated SNPs: rs9315851, rs2967055, and rs2531691, respectively. Y-axes indicate the residuals of height after accounting for sex and age using a linear mixed model (see Methods); x-axis indicates the genotype for the plotted individuals. D), E), F): Map of allele frequencies indicated by pie charts in worldwide HGDP populations for rs9315851, rs2967055, and rs2531691, respectively (from http://hgdp.uchicago.edu). 92 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

Table 5.1: Pearson correlations of self-reported measures of ancestry with ancestries inferred by AD- MIXTURE for each individual. Self-Report Measurea Correlationb p-value European* 0.545 2.6x10−9 European/Griqua* 0.514 2.7x10−8 San† 0.336 0.0005 San/Nama† 0.387 5.4x10−5 Bantu** 0.231 0.019 aSelf-reported measure is defined in the Methods, Equation 5.1. bCorrelations are performed against genetic ancestries inferred by ADMIXTURE (Figure 5.2); the genetic ancestry against which the measure is correlated is given by the symbol: * – European, † – San, and ** – Bantu. 5.6. FIGURES AND TABLES 93

Table 5.2: Means (and standard deviations) of measured phenotypes for all individuals and the subset of genotyped individuals. Phenotype Overall Females Males All Individuals (n = 103)a Height (cm) (n = 102) 155.23 (7.95) 151.78 (5.18) 163.90 (7.08) Inner Arm Pigmentation (M) 55.01 (10.59) 55.41 (11.24) 54.00 (8.80) Wrist Pigmentation (M) 70.96 (12.37) 69.12 (12.04) 75.64 (12.14) Wrist - Inner Arm Pigmentation (M) 15.94 (9.60) 13.71 (9.06) 21.64 (8.67) Genotyped Individuals (n = 86) Height (cm) (n = 85) 154.94 (8.17) 151.11 (5.02) 163.64 (7.24) Inner Arm Pigmentation (M) 54.35 (10.54) 54.66 (10.54) 53.65 (8.72) Wrist Pigmentation (M) 70.71 (12.40) 68.99 (12.46) 74.67 (11.54) Wrist - Inner Arm Pigmentation (M) 16.35 (9.29) 14.33 (8.90) 21.02 (8.62) a Sample size (n) for each set of individuals applies to each phenotype unless otherwise indicated. 94 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

Table 5.3: Estimates of heritability (h2) of height, inner arm skin pigmentation, and tanning (wrist minus inner arm pigmentation) using various methods. a ˆ2 b ˆ2 ˆ2 Method hheight (S.E.) hpigm (S.E.) htanning (S.E.) Parent/Offspringc 0.608 (0.454) 0.808 (0.5426) -0.025 (0.6778) (18 non-independent pairs) Parent/Offspring 0.852 (0.843) 0.528 (0.685) 0.110 (0.7712) (12 independent pairs – set 1) Parent/Offspring 0.895 (0.489) 0.4085 (0.5704) -0.415 (0.546) (12 independent pairs – set 2) Mixed Modeld 0.999 0.999 0.455 (kinship matrix from pedigree) GCTA [310]d,e 0.588 (0.379) 0.999 (0.173) 0.385 (0.272) a See Methods (Section 5.2.4) for details of the indicated methods. b S.E. indicates the standard error of the estimate; no standard errors could be calculated for the mixed model method. c For height and tanning (wrist minus inner arm pigmentation), parent/offspring heritability estimates were calculated using the residuals from a regression of height on sex and age (see Methods). Non-adjusted values were used for inner arm pigmentation. d For height and tanning, both the mixed model and GCTA methods account for sex and age as covariates. No covariates were included for analyses of inner arm pigmentation. e Only genotyped individuals were included for the GCTA analysis; all other estimates include all individuals, even those who were not genotyped. 5.6. FIGURES AND TABLES 95

Table 5.4: Coefficients for regressions of height, pigmentation, and tanning on ancestries. Phenotype Modela European Bantu San East African Heightb European Only -0.365 Bantu Only -1.053 San Only -0.211 East African Only 8.369 Inner Arm European Only -24.98*** Pigmentationc Bantu Only 33.10*** San Only 2.75 East African Only 0.320 European and Bantu -21.01** 28.54** European and San -51.93*** -27.70*** Bantu and San 45.11*** 15.93* European, Bantu, and San -43.70 8.33 -20.43 Tanning European Only 3.73 (Wrist - Arm)b Bantu Only 2.25 San Only -3.73 East African Only 8.57 a The “Model” column indicates the ancestries that are included in the model. P-values from the regression ˆ are indicated after each βancestry coefficient as follows: ***: p < 0.001; ** p < 0.01; * p < 0.05. b For height and tanning (wrist minus inner arm pigmentation), coefficients are calculated from a model also ˆ ˆ ˆ including sex and age (i.e., βancestry values are partial regression coefficients); βsex and βage are not shown. c ˆ For pigmentation, for models with more than one ancestry, βancestry coefficients are the partial regression coefficients with their associated p-values. 96 CHAPTER 5. HEIGHT AND PIGMENTATION IN THE 6=KHOMANI BUSHMEN

Table 5.5: Top SNPs associated with height from a genome-wide association study using EMMAX. a b SNP Location Position Effect (Other) Frequency βSNP (S.E.) p-value rs9315851 chr 13q14.11 42203574 C* (T) 0.788 5.3686 (1.0752) 3.066x10−7 rs2967055 chr 5p13 36472990 G* (A) 0.878 5.9926 (1.26013) 1.462x10−6 rs2531691 chr 10q25.3 118639142 T* (C) 0.965 10.8652 (2.5348) 6.093x10−6 a Effect allele indicates the allele with the associated βSNP effect. * denotes the ancestral allele as indicated by dbSNP. b Frequency of the effect allele in the 6=Khomani Bushmen. A Supplemental Materials: On the Classification of Genetic Interactions

From Genetics. 2010 Mar; 184(3):827-37.

A.1 Supplemental Methods

A.1.1 Derivation of MLEs for Multiplicative and Minimum Epistasis

For multiplicative epistasis, the Maximum Likelihood (ML) estimators for the epistasis coefficient and fitness variance of double mutants are

ˆ(p) = wij wiwj − 1 ijMLE 2 2 wi wj (p) (A.1) σˆ2 = 1 PN (wn − wnwn(1 + ˆ(p) ))2, ijMLE N n=1 ij i j ijMLE where the right superscript (p) indicates that the estimator is for multiplicative epistasis. For minimum epistasis, the ML estimators for the epistasis coefficient and fitness variance of double mutants are ˆ(m) = 1 PN (wn − min(wn, wn)) ijMLE N n=1 ij i j (m) (A.2) σˆ2 = 1 PN (wn − min(wn, wn) − ˆ(m) )2, ijMLE N n=1 ij i j ijMLE where the right superscript (m) indicates that the estimator is for minimum epistasis. Formulas for the additive MLEs are in Chapter2.

A.1.2 Simulation of Double Mutant Fitness Values

When simulating double mutant fitness values, the support is [0, 1.2], allowing for both synergistic and antagonistic epistasis. If gi and gj act multiplicatively, that is, there is no multiplicative epistasis, wij ∼ 2 N(wiwj, σij)[0,1.2], and if gi and gj act according to the minimum definition of independence, wij ∼ 2 N(min(wi, wj), σij)[0,1.2]. (Distribution for under additive independence is shown in Chapter2). For multi- 2 plicative epistasis, wij ∼ N(wiwj(1 + ij), σij)[0,1.2], and for minimum epistasis, wij ∼ N(min(wi, wj) + 2 ij, σij)[0,1.2]. (Distribution for additive epistasis is given in Chapter2).

A.1.3 Details of Analysis of Experimental Data

The first dataset (St Onge et al. [255]) included 26 non-essential genes known to confer resistance to the DNA-damaging agent methyl methanesulfonate (MMS), and contained double-deletion strains for all pairs

97 98 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

that were possible to construct (a total of 323 double mutant strains). The authors measured the fitnesses of single and double mutants in media both with and without MMS. For both media types, while single mutants have approximately 10 replicate fitness measurements, double mutants have only two replicates. We modify our procedure to analyze this dataset. For each gene in each pair, we randomly partition its single-mutant fitness values into two parts and use the mean of each part as the two single-mutant fitness replicates. Because of this procedure’s variability, we repeat this procedure 1000 times, each time determining the subtype of epistasis for each pair. We then select gene pairs for which one of the three epistatic subtypes was chosen in at least 900 of the replicates, and select as the final epistatic subtype for each pair that which was chosen in the majority of the replicates. Changing the cutoff value to either 950 or 800 replicates to create this “high-confidence” epistatic set does not greatly affect our results. We average values of the LRT statistic (λ) and  over replicates where the given subtype was selected to obtain the final values. When examining pairs of genes which share “specific” functional links (defined in Chapter2), we obtain the appropriate Gene Ontology (GO) terms in all three GO categories [270] for the genes of interest with AmiGO (for St Onge et al. [255] data, GO database release 2009-09-08; for Jasnos and Korona [131] data, GO database release 2009-09-22). In order to assess significance of both specific functional links and shared GO Slim terms among the inferred epistatic pairs, we implement both Fisher’s Exact Test and a permutation procedure with 5000 iterations. For each permutation, we randomly partition the n gene pairs originally tested for epistasis into two groups, x pairs and n − x pairs, where x is the number of inferred epistatic pairs. We then count the number out of x pairs that share either a functional link or GO Slim term link. We set the p-value as the proportion of iterations where the number of links exceeds our observation. To examine only experimentally-verified interactions in the BIOGRID database [257], we ignored in- teractions inferred quantitatively from phenotype data and whose “evidence codes” were one the following: “Dosage Growth Defect,” “Dosage Lethality,” “Dosage Rescue,” “Phenotypic Enhancement,” “Phenotypic Suppression,” “Synthetic Defect,” “Synthetic Haploinsufficiency,” “Synthetic Lethality,” and “Synthetic Res- cue.” We focused only on experimentally-validated (physical) interactions, identified by co-localization, co-purification, etc. In addition, we note that in our formulation (see Chapter2), the multiplicative epistatic model is undefined ˆ2 when double mutants have a fitness of 0 (ˆij = -1 and σij = 0). In our current implementation of the method, we do not attempt to further sub-classify these interactions as either additive or minimum epistasis. We feel that because synthetic lethals have a double mutant fitness of 0, any further sub-classification is unnecessary.

A.2 Supplemental Results

A.2.1 Additional Analyses of Simulated Data

We assess the accuracy of inferring the correct epistatic subtype when decreasing the sample size to 10 (Table A.2; see also Table 2.1, Chapter2). The ability to distinguish null models from epistatic models is high, with accuracy of approximately 0.85 ∼ 0.9 for various values of . While the accuracy of inferring A.2. SUPPLEMENTAL RESULTS 99

minimum epistasis decreases slightly (to about 80%), inference of multiplicative epistasis is heavily affected by a reduction of sample size (accuracy of 0.4 ∼ 0.6). The inference of additive epistasis is robust to a decrease in replicates when  is negative; however, when  is positive, accuracy decreases substantially (0.5 ∼ 0.6). See also Table A.3. We also relaxed our restriction to deleterious mutations to assess the effect of a more complex fitness landscape on the estimation of epistasis. As illustrated in Table A.1, our method is robust to beneficial mu- tations, which do not decrease the accuracy of inferring the correct epistatic subtype. If beneficial mutations are of specific interest, in the future it may be worthwhile to consider the Log model as an epistatic subtype [165, 236], as its exclusion may only be warranted for deleterious mutations. We also note that when at least one of the single mutant fitness values in a pair is 1, the predictions under each null model are equivalent and we have no power to detect the correct epistatic subtype (not shown). Another practical issue is that the accuracy of constructing an epistatic network (indicating all pairwise interactions) decreases substantially as the number of genes increases. Our method is effective in detecting the existence and type of epistasis for one gene pair (accuracy is ∼ 0.95 for minimum epistasis; Table 2.1, Chapter2). However, for more than 10 genes we must select the epistatic subtype for more than 45 pairs, decreasing accuracy to at most 0.9545 ≈ 0.1. This issue arises in all large-scale studies of epistasis, and is not unique to our method. To improve performance, we advocate controlling the FDR.

A.2.2 Additional Analyses of Experimental Data

To select one among the three possible multiple-hypothesis testing procedures for the St Onge et al. [255] dataset, we examine the number of inferred epistatic pairs which share “specific” functional links (described in Chapter2). 36 of the 323 pairs examined in the dataset share a specific functional link (including five synthetic lethals in the dataset). When using the Bonferroni correction, out of the 25 gene pairs for which we identify epistasis, only 1 gene pair has a functional link. (See Discussion in Chapter2 for implications of this result.) Implementing the FDR procedure [26], we find 188 gene pairs with epistasis (not including synthetic lethal double mutants), of which 20 have a functional link. St Onge et al. [255] discover 133 gene pairs with significant epistasis; as they find this to be a reasonable number, 188 pairs also appears reasonable. Finally, with the pFDR procedure [258], we find 222 gene pairs with epistasis; this procedure rejects neutrality for all pairs for which an epistatic model was originally selected (for the pFDR, the value above which p-values are estimated to be uniformly distributed is 0.1). In order to assess significance of these values, we perform Fisher’s exact test and a permutation procedure (described in Chapter2); however, the number of functional links observed among epistatic pairs is not significant for any of the multiple-testing measures. We show several examples of epistatic networks found with our method which considers all three epistatic subtypes and their corresponding null models (with the FDR procedure) when examining the St Onge et al. [255] dataset. In Figure A.6(b), we show the epistatic interactions inferred between all pairs of genes with functional links, as well as the type of epistasis for each connection. We find a mixture of positive and negative epistasis, and slightly different results than the original authors. In Figure A.6(c), we show, following 100 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

the original authors, the homologous recombination pathway including the genes RAD51, RAD52, RAD54, RAD55 and RAD57. While St Onge et al. [255] find 10 epistatic pairs among these genes, we find only 7 epistatic pairs, all of which are classified as minimum epistasis. In Figure A.6(d), we present epistasis found for genes of the SHU complex (SHU1, SHU2, CSM2 and PSY3), as do St Onge et al. [255]. We do not find PSY3 to interact with any of the genes in this pathway, and find a mixture of multiplicative and minimum epistasis. In constructing the “high-confidence” set of epistatic pairs through sample replications (Chapter2), we are likely more conservative than St Onge et al. [255] in identifying epistasis. In addition, these authors do not implement any multiple-hypothesis testing procedures. This probably slightly increases the power of their method over ours, yet unfortunately at the expense of detecting more false positives of gene pairs with epistasis. We also searched for over-representation of GO Slim terms in genes involved in each additive, multi- plicative, and minimum subtypes, as well as positive and negative epistasis; however, we found no significant results (The Gene Ontology Consortium [270], http://www.yeastgenome.org). We next examine the Jasnos and Korona [131] dataset and look for “specific” functional links that exist among epistatic pairs inferred with each of the three multiple hypothesis testing procedures. We examine Gene Ontology (GO) [270] terms with fewer than 200 genes associated with them (described in Chapter2) and find 25 links among all 636 tested pairs. Using the Bonferroni procedure, 56 epistatic pairs are significant, of which 5 have a functional link. The FDR procedure results in 352 significant epistatic pairs, of which 19 have a functional link. For the pFDR, we detect 471 epistatic pairs, of which 21 have a functional link. When assessing significance of the number of functional links among epistatic pairs by permutation (see Chapter 2), we find that only the FDR results in a significant number of functional links (p-value = 0.024). The epistatic networks discovered with our method and the FDR procedure in the Jasnos and Korona [131] dataset are small and isolated, likely due to the examination of only a subset of the possible pairwise genotypes in the study. We annotate functions of the genes in the networks according to the yeast Gene Ontology (GO) Slim Biological Process terms (following Jasnos and Korona [131], see Chapter2). Four examples of constructed epistatic networks denoting pairwise epistatic interactions are shown in Figure A.7. Figure A.7(b) shows minimum epistatic interactions between three genes involved in the biological processes of mitochondrion organization and translation, all of which are functionally linked (as defined above) with the GO term mitochondrial protein formation and synthesis (GO:0032543); MRPL37 and MSK1 are also linked through the mitochondria (GO:005739). These functional links suggest that the epistatic interactions we detect may in fact be meaningful. In Figure A.7(c), we show that genes IFM1 and SNT309 have alleviating positive epistasis; we also find a positive minimum epistatic interaction between SNT309 and BUD28. We find a larger linked group of six genes, among which there are two pairs of genes that have epistasis and share a functional link (Fig- ure A.7(d)). In this network, we find a mixture of both minimum and additive epistasis, and both positive and negative epistasis. The functional links and shared GO Slim annotations in this network suggest that this novel epistasis is likely biologically significant. A.2. SUPPLEMENTAL RESULTS 101

Finally, Figure A.7(e) illustrates an example of a larger group of twelve genes connected by three types of epistasis. Of the twelve genes, four (RDH54, HCM1, CIN8 and SGS1) have roles in the cell cycle. It is possible that other genes in this network, such as SWA2, may also be involved in this process, although they have not yet been studied. A specific functional link also exists between genes SGS1 and CIN8, which are both involved in mitotic chromosome segregation. We believe that these, along with the other interactions identified, are true epistatic interactions. Although we search for significant enrichment of particular GO terms amongst genes involved in additive, multiplicative, and minimum epistasis, we fail to find any, indicating that the subtypes of epistasis are likely distributed over many functional classes of genes. This lack of significance is also likely due to the small number of double mutants examined. As discussed in Chapter2, we assess the normality of the replicate fitness data for both studied datasets. A deviation from normality may suggest that the χ2 approximation for the distribution of the likelihood ratio test statistic under the null hypothesis is inappropriate in some cases. Double-mutant fitness values of the St Onge et al. [255] dataset certainly deviate from normality, as only two replicates are measured (Figure A.8). In Figure A.8 we also show QQ plots for several of the genes of the SHU complex, CSM2, CSM3, and PSY3 (presented in St Onge et al. [255]). Although functional links exist, we do not find PSY3 to interact with either gene; this could be due to the deviation of the CSM3 single-mutant fitness values from normality. For the Jasnos and Korona [131] dataset, we examine the QQ plots of the genes NCS2 and SGO1 (Figure A.9); although a previous interaction has been identified for this pair, we do not detect epistasis. The deviation from normality of the single-mutant fitness values of SGO1 and the small number of double-mutant fitness replicates could explain our inability to detect this interaction. The distributions of fitness values for several other genes show similar deviations from normality (not shown).

A.2.3 Consideration of Alternative Model Selection Procedures

To explore whether the accuracy of inferring the epistatic subtype for pairs of genes could be improved, we also tried using two alternatives to the BIC procedure. The first method is similar to cross-validation. The sample was split randomly into a modeling set and a testing set. We fit each of the six models to the modeling data set and computed the mean squared error per model for the testing set. After repeating this procedure 1000 times, we selected the model with the smallest accumulated mean squared error. Overall, this method failed to outperform the BIC (data not shown). The difficulty with testing for epistasis is that the null hypothesis can be any one of the three null models of independence, and is thus not well-specified. We explored using the Expectation-Maximization (EM) algorithm [63] for this problem in order perform a different likelihood ratio test. Under the null model of no epistasis, double mutants can be distributed as a mixture of the three null models; under the epistatic model, the distribution is a mixture model of the three epistatic subtypes. Using the EM algorithm, we maximized the likelihood under each model, and performed a likelihood ratio test. If the null hypothesis of independence was rejected, we selected the epistatic measure for the gene pair as the mixture component that had the highest 102 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

estimated proportion. However, we found that this method resulted in poorer performance (data not shown). Thus, it appears that the more simplistic BIC method currently implemented is the most appropriate method for this estimation problem. A.3. SUPPLEMENTAL FIGURES AND TABLES 103

A.3 Supplemental Figures and Tables

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis

0.07 BIC_Add−BIC_Prod 0.10 BIC_Prod−BIC_Add BIC_Add−BIC_Min BIC_Prod−BIC_Min BIC_Add−BIC_Add_Null BIC_Prod−BIC_Add_Null

0.06 BIC_Add−BIC_Prod_Null BIC_Prod−BIC_Prod_Null BIC_Add−BIC_Min_Null BIC_Prod−BIC_Min_Null 0.08 0.05 0.06 0.04 Density Density 0.03 0.04 0.02 0.02 0.01 0.00 0.00

−150 −100 −50 0 −100 −50 0 BIC Difference BIC Difference (A) (B)

Simulated with Minimum Epistasis

0.10 BIC_Min−BIC_Add BIC_Min−BIC_Prod BIC_Min−BIC_Add_Null BIC_Min−BIC_Prod_Null BIC_Min−BIC_Min_Null 0.08 0.06 Density 0.04 0.02 0.00

−150 −100 −50 0 BIC Difference (C)

Figure A.1: Power of our method to uncover the true (simulated) epistatic model assuming ij = −0.3, σij = 0.1, and µi = µj = 0.8. (A) The area under each curve and less than zero represents the percentage of simulations with the additive epistatic model favored over the other models shown in the legend. The red, blue, yellow, green, and purple lines represent the comparison of the BIC between the additive epistatic model and the multiplicative epistatic model, minimum epistatic model, additive null model, multiplicative null model, and minimum null model, respectively. (B) and (C) are shown in a manner similar to (A) for the multiplicative and minimum epistatic models, respectively. 104 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis

BIC_Add−BIC_Prod 0.25 BIC_Prod−BIC_Add BIC_Add−BIC_Min BIC_Prod−BIC_Min BIC_Add−BIC_Add_Null BIC_Prod−BIC_Add_Null 0.15 BIC_Add−BIC_Prod_Null BIC_Prod−BIC_Prod_Null BIC_Add−BIC_Min_Null BIC_Prod−BIC_Min_Null 0.20 0.15 0.10 Density Density 0.10 0.05 0.05 0.00 0.00

−150 −100 −50 0 −100 −50 0 BIC Difference BIC Difference (A) (B)

Simulated with Minimum Epistasis

0.07 BIC_Min−BIC_Add BIC_Min−BIC_Prod BIC_Min−BIC_Add_Null

0.06 BIC_Min−BIC_Prod_Null BIC_Min−BIC_Min_Null 0.05 0.04 Density 0.03 0.02 0.01 0.00

−200 −150 −100 −50 0 BIC Difference (C)

Figure A.2: Power of our method to uncover the true (simulated) epistatic model assuming ij = 0.3, σij = 0.1, and µi = µj = 0.8. Legends are as defined in Figure A.1. A.3. SUPPLEMENTAL FIGURES AND TABLES 105

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis Simulated with Minimum Epistasis

200 MSE_Add 350 MSE_Add 200 MSE_Add MSE_Prod MSE_Prod MSE_Prod MSE_Min MSE_Min MSE_Min MSE_Null MSE_Null MSE_Null 300 150 150 250 200 100 100 Density Density Density 150 100 50 50 50 0 0 0

0.00 0.05 0.10 0.15 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.02 0.04 0.06 0.08 0.10 0.12 MSE MSE MSE (A1) (A2) (A3)

(a)

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis Simulated with Minimum Epistasis

300 MSE_Add 400 MSE_Add 350 MSE_Add MSE_Prod MSE_Prod MSE_Prod MSE_Min MSE_Min MSE_Min MSE_Null MSE_Null MSE_Null 300 250 300 250 200 200 150 200 Density Density Density 150 100 100 100 50 50 0 0 0

0.00 0.01 0.02 0.03 0.04 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 MSE MSE MSE (B1) (B2) (B3)

(b)

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis Simulated with Minimum Epistasis

300 MSE_Add MSE_Add 300 MSE_Add 300 MSE_Prod MSE_Prod MSE_Prod MSE_Min MSE_Min MSE_Min MSE_Null MSE_Null MSE_Null 250 250 250 200 200 200 150 150 150 Density Density Density 100 100 100 50 50 50 0 0 0

0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015 0.00 0.01 0.02 0.03 0.04 0.05 0.06 MSE MSE MSE (C1) (C2) (C3)

(c)

Simulated with Additive Epistasis Simulated with Multiplicative Epistasis Simulated with Minimum Epistasis

MSE_Add MSE_Add MSE_Add MSE_Prod MSE_Prod MSE_Prod

200 MSE_Min MSE_Min 200 MSE_Min 250 MSE_Null MSE_Null MSE_Null 200 150 150 150 Density Density Density 100 100 100 50 50 50 0 0 0

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.05 0.10 0.15 MSE MSE MSE (D1) (D2) (D3)

(d)

Figure A.3: The distribution of the mean squared error (MSE) of the MLE of the epistatic parameter under the additive, multiplicative, and minimum epistatic models as well as the corresponding null model (with mean fitnesses of single mutants µi = 0.9 and µj = 0.6). a)1-3 show the distribution of MSE assuming epistatic coefficient ij = −0.3; b)1-3 assume ij = −0.1; c)1-3 assume ij = 0.1; d)1-3 assume ij = 0.3. The red, blue, and yellow lines represents the distribution of the MSE of ˆMLE under the additive, multiplicative, and minimum epistatic models, respectively; the green line represents the MSE of ˆMLE under the null model (of the simulated definition). 106 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

St. Onge, et al. BIC Method 88 25 105

9 experimentally identified 3 experimentally identified

Figure A.4: Venn diagram comparing the overlap of epistatic pairs found examining the St Onge et al. [255] dataset using the authors’ original method and our BIC method (including all three epistatic subtypes and their corresponding null models with the FDR procedure). For the pairs identified by only one method, we indicate the number for which an interaction has already been experimentally identified (obtained using the BIOGRID database). A.3. SUPPLEMENTAL FIGURES AND TABLES 107

non−GO−linked pairs 30 GO−linked_pairs 25 20 15 10 Number of gene pairs 5 0

−1 0 1 2 3 4

ε

Figure A.5: Distribution of the epistasis values () for significant epistatic pairs, determined using the FDR procedure with the BIC method (including all three epistatic subtypes and their corresponding null models) for all gene pairs of the St Onge et al. [255] dataset in the presence of MMS (bin size = 0.05).  values for gene pairs with specific functional links as determined by GO terms (explained in text) are shown in red (mean of -0.060, not significantly different than 0), and for those without specific functional links are shown in black. A similar figure is shown in St Onge et al. [255]. 108 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

Epistatic Subtypes Additive Multiplicative Minimum

(a)

RAD 5 + - I

RAD RAD + I RAD + 18 55 52 RTT SHU - I SHU 101 - I - + 1 2 + - I RAD RAD - + I 57 51 - I - I SGS CSM MMS + 1 2 1 + RAD - - 54 - I - CSM RAD Lethal 3 - 59 MMS 4

(b)

RAD + I RAD SHU - I SHU 51 52 1 2 - I - - - I RAD RAD SGS CSM 57 55 1 2 - I

+ RAD + - CSM - 54 3

(c) (d)

Figure A.6: Several epistatic networks constructed based on the St Onge et al. [255] dataset (using the FDR procedure with the BIC method including all three epistatic subtypes and their corresponding null models). Legend in (a) denotes the possible subtypes for each epistatic interaction. (b) depicts all epistatic interactions that were found between genes that also share “specific” functional links (defined in text); note that epistatic interactions were inferred between other genes in this network that do not have a specific functional link (not shown). (c) represents all inferred interactions in the RAD homologous recom- bination pathway. (d) represents all inferred interactions in the SHU pathway. “+” sign indicates positive epistasis, “-” sign indicates negative epistasis, and “I” indicates that the interaction has been experimentally identified. A.3. SUPPLEMENTAL FIGURES AND TABLES 109

Epistatic Subtypes Additive Multiplicative Minimum

GO Slim Biological Processes Chromosome Organization MRPL31 MRPL37 IFM1 BUD28 Cytoskeleton Organization Mitochondrion Organization Transcription

Translation

DNA Metabolic Processes - - + + RNA Metabolic Processes * * Cell Cycle

Cellular Component Morphogenesis Cellular Homeostasis

Cytokinesis MSK1 SNT309 Response to Stress Unknown/Other

(a) (b) (c)

FYV3 SWA2 +

+ + FYV7 RPS4B RAD6 SGS1 PET111 KRE22

+ + - - * - - * + * PRO1 SRB8 SNF5 RDH54 CIN8 CAP2 NPT1

+ + + - - +

MRPL32 HCM1 JEN1

(d) (e)

Figure A.7: Several epistatic networks constructed based on the Jasnos and Korona [131] dataset (using the FDR procedure with the BIC method including all three epistatic subtypes and their corresponding null models). Legend in (a) shows each type of epistatic interaction and each color which corresponds to one biological process as listed by the Gene Ontology (GO Slim terms). (b), (c), (d), and (e) are examples of inferred three-gene, three-gene, six-gene, and twelve-gene networks, respectively. “+” sign indicates positive epistasis, “-” sign indicates negative epistasis, “*” indicates that the gene pair has a specific functional link (described in text). 110 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

CSM2 CSM3 PSY3

● ● ● ● 0.90 ● 0.94

● ● 0.88 0.88

● ● 0.92

● 0.86 ● ● ● 0.86 0.90 ● ●

0.84 ● 0.84 0.88 Sample Quantiles ● Sample Quantiles Sample Quantiles ● 0.82 ● ● ● ● ● ● ● 0.86 ●

● ● 0.82 ● ● 0.80 −1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

CSM2/CSM3 CSM2/PSY3 CSM3/PSY3

● ● ● 0.845 0.79 0.764 0.840 0.77 0.760 Sample Quantiles Sample Quantiles Sample Quantiles 0.835

● ● ● 0.75 0.756 −0.6 −0.2 0.2 0.6 −0.6 −0.2 0.2 0.6 −0.6 −0.2 0.2 0.6

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Figure A.8: Quantile-Quantile plot of the single-mutant fitness values for the genes CSM2, CSM3, and PSY3 (top row), and the double-mutant fitness values for CSM2/CSM3, CSM2/PSY3, and CSM3/PSY3 (bottom row) (St Onge et al. [255] dataset). We identify an interaction between CSM2 and CSM3, but not for the other two pairs, when analyzing this dataset. A.3. SUPPLEMENTAL FIGURES AND TABLES 111

NSC2 SGO1

● ● 1.0 ● ● ● 1.02 ● ● 0.8 ● 0.98 0.6

● ●

Sample Quantiles Sample Quantiles ● 0.94 ● ● 0.4

−1.0 0.0 0.5 1.0 −1.0 0.0 0.5 1.0

Theoretical Quantiles Theoretical Quantiles

NSC2/SGO1

● 0.97

● 0.95 Sample Quantiles ● 0.93

−0.5 0.0 0.5

Theoretical Quantiles

Figure A.9: Quantile-Quantile plot of the single-mutant fitness values for the genes NCS2 and SGO1 (top row), and the double-mutant fitness values of NCS2/SGO1 (bottom row) (Jasnos and Korona [131] dataset). We do not identify an interaction between these two genes. 112 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

Table A.1: Fractions of simulations that recover the true model among the six models when allowing both beneficial and deleterious mutations, i.e., the range for fitness of single mutants is [0.0, 1.2] and the range for fitness of double mutants is [0.0, 1.5]. (a) (p) (m) (a) (p) (m) µi µj PN PN PN ij PE PE PE 0.8 0.8 -0.3 0.996 0.903 0.929 0.8 0.8 -0.2 0.991 0.899 0.956 0.8 0.8 0.987 0.986 0.993 -0.1 0.964 0.869 0.982 0.8 0.8 0.1 0.742 0.719 0.997 0.8 0.8 0.2 0.581 0.602 0.998 0.8 0.8 0.3 0.740 0.613 0.998 0.9 0.6 -0.3 0.999 0.964 0.982 0.9 0.6 -0.2 0.997 0.948 0.971 0.9 0.6 0.989 0.987 0.994 -0.1 0.986 0.901 0.979 0.9 0.6 0.1 0.878 0.855 0.997 0.9 0.6 0.2 0.836 0.843 0.999 0.9 0.6 0.3 0.891 0.837 0.999

For all the simulations, the fitness standard deviation σi = σj = σij = 0.05. The rest of the parameters are defined in Table 2.1, Chapter2. 50 replicate fitness values are simulated. A.3. SUPPLEMENTAL FIGURES AND TABLES 113

Table A.2: Fractions of simulations that recover the true model among the six models, with different values of the epistasis coefficient (ij) and mean fitnesses of single mutants (µi) and (µj); number of fitness replicates is equal to 10. (a) (p) (m) (a) (p) (m) µi µj PN PN PN ij PE PE PE 0.8 0.8 -0.3 0.836 0.574 0.705 0.8 0.8 -0.2 0.807 0.533 0.732 0.8 0.8 0.842 0.824 0.926 -0.1 0.758 0.396 0.762 0.8 0.8 0.1 0.59 0.423 0.871 0.8 0.8 0.2 0.472 0.335 0.875 0.8 0.8 0.3 0.491 0.506 0.898 0.9 0.6 -0.3 0.85 0.629 0.742 0.9 0.6 -0.2 0.833 0.589 0.753 0.9 0.6 0.856 0.821 0.903 -0.1 0.796 0.405 0.732 0.9 0.6 0.1 0.677 0.374 0.848 0.9 0.6 0.2 0.64 0.487 0.873 0.9 0.6 0.3 0.645 0.527 0.884

For all simulations, the fitness standard deviation σi = σj = σij = 0.05. The rest of the parameters are defined in Table 2.1, Chapter2. 114 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS

Table A.3: Fractions of simulations that recover the true model among the six models with different values of the epistasis coefficient (ij) and mean fitnesses of single mutants (µi) and (µj); number of replicates is equal to 2. (a) (p) (m) (a) (p) (m) µi µj PN PN PN ij PE PE PE 0.8 0.8 -0.3 0.527 0.315 0.469 0.8 0.8 -0.2 0.519 0.285 0.511 0.8 0.8 0.261 0.195 0.338 -0.1 0.498 0.232 0.533 0.8 0.8 0.1 0.444 0.175 0.591 0.8 0.8 0.2 0.36 0.199 0.599 0.8 0.8 0.3 0.255 0.344 0.603 0.9 0.6 -0.3 0.514 0.341 0.455 0.9 0.6 -0.2 0.509 0.302 0.494 0.9 0.6 0.273 0.161 0.301 -0.1 0.512 0.231 0.505 0.9 0.6 0.1 0.492 0.173 0.567 0.9 0.6 0.2 0.446 0.202 0.557 0.9 0.6 0.3 0.394 0.274 0.555

For all the simulations, the fitness standard deviation σi = σj = σij = 0.05. The rest of the parameters are defined in Table 2.1, Chapter2. A.3. SUPPLEMENTAL FIGURES AND TABLES 115

Table A.4: Summary of gene pairs with the indicated epistatic subtypes, inferred using the FDR pro- cedure with the BIC method which considers all three epistatic subtypes and their corresponding null models (St Onge et al. [255] dataset, with fitnesses measured in the absence of MMS). Epistatic Subtype Study S (No MMS) 171 (100%) All  = -0.063***

q0.5 = -0.057 4 (2.3%) Additive  = 0.049*

q0.5 = 0.074 7 (4.1%) Multiplicative  = -0.109**

q0.5 = -0.084 155 (90.6%) Minimum  = -0.063***

q0.5 = -0.057 Numbers are the counts of each type, and percentages are given of the total number of epistatic pairs. The mean () and median (q0.5 ) of the epistatic parameter () are given for each subtype, with “*” indicating that the mean of  is significantly different from 0 (p-value ≤ 0.05; “**” implies p-value ≤ 0.01, “***” implies p-value ≤ 0.001). Note that 5 of the epistatic pairs are synthetic lethal (not shown in table). 116 APPENDIX A. SUPPLEMENT: CLASSIFICATION OF GENETIC INTERACTIONS esrdi h bec fMS ubr nprnhssidct h oa ubro etdpisadtettlnme fec yeo ikfound link of type each of number total dataset, the [ 255 ] and al. pairs et tested Onge of St number the total to the refers indicate S dataset. parentheses Study in complete Numbers each significance. MMS. in indicates of “*” absence Test, the Exact Fisher’s in by measured p-values indicate parentheses in Numbers S, Study for details) for text (see subtype (M) minimum multiplicative the and additive only the or only (P), M), multiplicative P, (A, the MMS. subtypes only of epistatic absence (A), the all additive in models: measured the null fitnesses only corresponding P), their (A, with subtypes subtypes epistatic of subset a A.5: Table OSi em Booia rcs)(283) Process) (Biological Terms Slim GO xeietlyIetfid(29) Identified Experimentally OSi em Al (307) (All) Terms Slim GO oprsno aiainmaue o ahdtstfrdfeetvrain fteFRadBCpoeue osdrn only considering procedure, BIC and FDR the of variations different for dataset each for measures validation of Comparison ucinlLns(36) Links Functional ubrFud(323) Found Number td N MMS) (No S Study 5 (0.0058)* 158 6 (0.1558) 165 0(0.4391) 20 ,P M P, A, 171 14 1 (0.0010)* 114 1 (0.4151) 115 (0.0135)* 20 utpsCniee nBCProcedure BIC in Considered Subtypes ,P A, 120 14 4 (0.0037)* 145 4 (0.4548) 149 (0.0056)* 25 156 18 A 5(0.0011)* 95 (0.0460)* 16 5(0.4241) 95 12 99 P 9 (0.08893) 196 (0.0043)* 186 8(0.9682) 18 203 14 M A.3. SUPPLEMENTAL FIGURES AND TABLES 117 3 M 21 14 329 243 203 68 (0.0)* 15 (0.2466) 24 (0.8102) 18 (0.9198) 196 (0.088) 213 (0.007)* 213 (0.5560) 231 (0.5794) 186 (0.004)* 2 P 23 99 12 231 171 29 (0.0)* 10 (0.4216) 44 (0.4726) 95 (0.001)* 95 (0.4264) 146 (0.1172) 153 (0.1822) 162 (0.7006) 16 (0.0258)* 1 A 24 18 263 247 156 34 (0.0)* 11 (0.4700) 55 (0.1328) 160 (0.4112) 237 (0.1458) 25 (0.0018)* 149 (0.4510) 223 (0.0108)* 145 (0.0036)* 2 22 14 273 192 120 A, P Subtypes Considered in BIC Procedure 13 (0.2274) 50 (0.6226) 172 (0.1180) 29 (0.0004)* 182 (0.6940) 20 (0.0052)* 115 (0.4182) 174 (0.0368)* 114 (0.0008)* 3 14 14 352 193 171 A, P, M 21 (0.473) 69 (0.2720) 20 (0.2986) 19 (0.0240)* 174 (0.0652) 185 (0.2806) 165 (0.1520) 224 (0.0220)* 158 (0.0052)* Study J Study S (MMS) Study S (No MMS) Number Found (636) Number Found (323) Number Found (323) Functional Links (25) Functional Links (36) Functional Links (36) Comparison of validation measures for each dataset for different variations of the FDR and BIC procedure, considering only GO Slim Terms (All) (369) GO Slim Terms (All) (307) GO Slim Terms (All) (307) Experimentally Identified (3) Experimentally Identified (29) Experimentally Identified (29) GO Slim Terms (Biological Process) (115) GO Slim Terms (Biological Process) (283) GO Slim Terms (Biological Process) (283) Table A.6: a subset of epistaticsubtypes subtypes (A, with P), their only corresponding the null additive (A), models: only all the epistatic multiplicative (P), subtypes or (A, only P, the M), minimum only (M) the subtype (see additive text and for multiplicative details). Numbers in parentheses indicate p-valuesStudy by S permutation, refers “*” to indicates the significance.type St Study of Onge J link et refers found al. to in [ 255 ] the each dataset. Jasnos complete Numbers and dataset. in Korona [ 131 ] parentheses dataset, indicate and the total number of tested pairs and the total number of each B

Supplemental Materials: Dispersal of Mycobacterium tuberculosis via the Canadian fur trade

From Proc. Natl. Acad. Sci. USA. 2011 Apr 19; 108(16):6526-31.

B.1 Supplemental Background Material

B.1.1 A Brief History of the Canadian Fur Trade, 1640-1870

The Canadian fur trade involved the barter of European goods for animal furs. European goods were imported (and furs exported) from Montreal, in present day Quebec, and Hudson’s Bay. In this report, we focus on the Montreal route, which would have been the means of dispersal of M. tuberculosis (M. tb) from French Cana- dians to indigenous peoples of the Canadian interior (it is likely that British immigrants using the Hudson’s Bay route also transmitted M. tb. to Aboriginal populations). We do not know whether Aboriginal popula- tions involved in fur trading had been exposed previously to M. tb. There is evidence of TB among some indigenous Canadian populations prior to contact with European immigrants, but the extent and severity of the disease are not known [295]. Trade in furs necessitated a network of transportation routes over a vast geographic area that eventually encompassed what are now referred to as Ontario (ON), Manitoba (MB), Saskatchewan (SK), Alberta (AB), British Columbia (BC), and the Northwest Territories (NWT). The earliest means of transportation was the canoe, and movement of goods through lakes, rivers and portages (gaps in the navigable system of rivers and lakes) required canoemen (voyageurs), guides, translators, navigators, administrators and others. Employees of fur trading companies were male; voyageurs, with “few exceptions... were French” [120]. Fur traders stayed in the Canadian interior for variable lengths of time (discussed further in the Supplemental Results (Section B.3) and contact with indigenous populations commonly involved sexual relationships with Native women, some of which were formalized as marriages, and others that involved cohabitation and procreation outside of marriage [82]. Native women in these relationships played an important role in fur trade commerce (e.g. by processing furs), although this role was not acknowledged formally [82]. Children born of these alliances became the Métis. In a recent census, almost 400,000 Canadians identified themselves as Métis [41].

B.1.2 Western Canadian Indigenous Populations in the Industrial Era, 1870-present

While historians debate the degree to which traditional fur trade activities affected indigenous Canadian populations, it is clear that Euro-Canadian society had a profound impact on Native Canadians in the late

118 B.2. SUPPLEMENTAL METHODS 119

19th century and beyond [84]. Industrial scale extraction of resources such as buffalo, furs and fish, as well as agricultural development, resulted in destruction of traditional food sources for Native peoples and their displacement from the land [84]. Legal agreements between the Canadian government and Native leaders resulted in confinement of Native populations to specific tracts of land (“reserves”), where subsistence was difficult if not impossible in some cases [84, 218]. Attempts to acculturate Native Canadian populations included creation of residential educational institutions that housed large numbers of children in conditions highly conducive to the spread of tuberculosis [172]. All of these influences promote high rates of tuberculosis by the following mechanisms: malnutrition, which increases penetrance of disease following infection [45, 160], increased population density, which supports large scale epidemics of infectious disease, and crowded, inadequate housing, which is known to promote transmission of tuberculosis [30, 44, 100, 164]. Epidemics of tuberculosis followed the environmental and social changes described above; they were severe enough that some worried Native Canadian populations would succumb completely to tuberculosis [58, 161]. Native Canadians are still disproportionately affected by TB [67], but the disease is no longer associated with the very high mortality rate observed in the 19th and early 20th centuries.

B.2 Supplemental Methods

B.2.1 Aboriginal Population Descriptions

The M. tb. isolates from ON, MB, SK, and AB analyzed in this study were originally collected from re- serves (lands formally designated for the use of specific First Nations), Métis communities (definition from Canada census, based on > 25% of total population identified as Métis), and rural communities not formally associated with specific Aboriginal populations. We did not include data from Aboriginal populations living in urban areas. ON samples are from a single region in Ontario, whereas the population base for each of the other samples was the entire province. M. tb. samples analyzed in this study were collected from Métis populations and a range of First Nations encompassing a diversity of Aboriginal languages, ethnicities and cultures found in the four provinces.

B.2.2 Real-time PCR Determination of Bacterial Lineage

Primer sequences for real-time PCR determination of M. tb. lineage (defined by polymorphisms described in [88, 185]) can be found at http://www.pnas.org. Two sets of primers and probes were designed for each polymorphism. Genomic DNA from M. tb. strains was tested in parallel for the wild type and the mutant sequence at each locus. If the strain was wild type at a particular locus, there would be a product from the former reaction and none from the latter (and vice versa for strains with the mutation). Wild type primer probe sets are designated with “(37)” in the name, after the laboratory reference strain H37Rv. Nomenclature of lineage-defining genomic deletions is adopted from [88, 280]. The mutant primer probe sets for Rd 219 and Rd 193 are not specific (due to technical constraints imposed by nature of flanking sequence) so mutants 120 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

were identified by absence of product with the wild type primer probe set, plus a product with the mutant primer probe set. Strains that were wild type at these loci would generate a product with both primer probe sets. For DS6Quebec, we designed a primer probe set for one flank of the deletion. We also used a primer probe set - courtesy of Gregory Dolganov - that was within the deletion (Rv 1760). Isolates with DS6Quebec were identified by presence of a product with the flanking primer probe set, and absence of a product with the primer probe set within the deletion (i.e. Rv 1760). Spoligotyping data (presence or absence of 43 oligonucleotide spacers at a direct repeat locus [137]) were available for some populations (QU, ON, SK): DS6Quebec lineage was further confirmed for M. tb. isolates from these populations by absence of spacers 9 and 10 [88]. The results of lineage typing were extrapolated to isolates from the same population, with an identical RFLP (± 1 band) and/or minisatellite genotype.

B.2.3 Mutation Rate Estimation

As we have sampled only one M. tuberculosis isolate per infected individual in this study, the appropriate timescale for mutation is per transmission generation, rather than per M. tuberculosis cell doubling. A trans- mission generation (estimated to be 1.2 years in length [203], also consistent with [264]) is defined as the time from initial infection by M. tb. to the time that the infection spreads to another individual. To estimate a per transmission generation mutation rate for minisatellite loci, we performed simulations in SIMCOAL 2.0 [73, 151]. Since no per generation mutation rate estimate in M. tuberculosis has been made previously for minisatellite loci, we assumed a mutation rate of 2.3 x 10−5 per locus per generation, based on estimates for minisatellite loci from parallel serial passaging experiments in Yersinia pestis and Escherichia coli O157:H7 [289, 290]. In the simulations, we also assumed a simple stepwise mutation model (SMM) for the min- isatellite loci, where a mutation at a locus only increases or decreases the repeat number by one [177, 195]. Proceeding forward in time in the simulations, we began with one bacterium, as animal experiments suggest that most infections are initiated by a single droplet [124, 224], which, as a result of its small size, cannot contain more than a few bacilli [34]. Based on radiographic studies of pulmonary tuberculosis, we assume that the disease involves an entire lung lobe by the time it is extensive enough to result in transmission [119]. We estimated the weight of adult lung lobes from autopsy studies [132]. Finally, for an estimate of the number of M. tb. bacilli/gram of infected lung tissue, we relied on Cannetti’s classic microscopy stud- ies of autopsy specimens (we used the average result across different clinical-pathological classifications of specimens) [43]. Based on these data sources, our final estimate of the M. tb. population size, at the time an infected human host transmits that infection, is 107 bacilli. In the simulation experiments, we simulated exponential doubling until 107 bacteria were present. At this point, we sampled 3 bacteria (an estimate of the number of bacilli contained in a single infectious droplet [34]) and counted the number of mutations in the sample compared to the original genotype. Repeating this procedure 1000 times, we obtained an estimate of 0.00125 mutations per locus per transmission generation. This agrees with the concept of a “transmission bottleneck,” where most mutations in the host’s bacterial population are lost during transmission [98]. To validate the above rate, we made use of the known RFLP mutation rate in M. tb. of approximately B.2. SUPPLEMENTAL METHODS 121

0.287 events per year for a bacterial strain with 10 copies of IS6110 [6, 228, 265]; note that we multiplied this mutation rate by 1.2 to convert the per-year RFLP mutation rate to per transmission generation. We first counted the number of minisatellite haplotypes per RFLP haplotype (for isolates from QU, SK, and the MIRU-VNTRplus database [116]) to calibrate a minisatellite mutation rate. We also estimated θ = 2Nµ using homozygosity (H = 1/(1 + θ)) for all RFLP and all minisatellite haplotypes in QU using Arlequin [74]; the ratio of the θ estimates should equal the ratio of the mutation rates for the two loci. Mutation rate estimates are shown in Table B.3, and are all of a similar order of magnitude. As mentioned in Chapter3, we settled on a mutation rate of 0.001 per locus per transmission generation. This lower rate is motivated by the fact that the RFLP mutation rates are pedigree-derived, which may be too fast for historical time scales [102, 112]. Our final estimate is also within the range from a recent report analyzing mutation rates at these loci [223] ([0.00084-0.018] per locus per generation assuming a generation time of 1.2 years).

B.2.4 Statistical Calculations

Given the limited documentation of horizontal gene transfer in M. tb. [107, 111, 254, 263, 305] we used hap- lotype statistics, rather than single locus statistics, in a majority of our analyses. To correct for sample size differences, we used rarefaction to obtain the number of distinct minisatellite and RFLP haplotypes (and the number of private minisatellite haplotypes) in each population (as described in [136]). Note that the number of private haplotypes is not calculated for RFLP haplotypes, which could not be compared across popula- tions. We also calculated diversity of minisatellite and RFLP haplotypes in each population as − P πlog(π) (where π is the haplotype frequency and the sum is over all haplotypes; [247]). In order to correct for unequal sample sizes, we sampled x individuals without replacement from each population (where x is the number of individuals of the smallest sample), calculated diversity, and repeated this 5000 times. In addition, we used a permutation procedure to assess whether the observed population diversities (calculated using the same formula) could be explained by randomly partitioning isolates to all populations. We pooled all haplotypes (from QU, ON, RS, NRS, AB), and for each population with sample size n, calculated diversity for n min- isatellite haplotypes sampled with replacement from the pool. For each population, an empirical distribution was generated for 1000 samples and compared to the observed diversity. We also examined the effect of excluding QU haplotypes from the pool and repeating the same procedure. In all of the above procedures, we removed any haplotype with at least one missing genotype position. In order to reduce noise in the following calculations, we focused only on isolates from the DS6Quebec lineage, as they share a more recent common ancestor than do isolates from all lineages and are thus less likely to be affected by homoplasy. We used the method of Ytime to calculate the time to the most recent common ancestor (TMRCA) of the DS6Quebec lineage in each population using minisatellite haplotypes [23]. This calculation, which assumes a simple SMM, involves dividing the average squared difference (ASD) in repeat length between each sample minisatellite haplotype and the putative root haplotype by the mutation rate (0.001 per locus per generation). We eliminated missing data at each locus and allowed loci to have different sample sizes in our calculation; this is in contrast to the Ytime implementation, which does not allow missing 122 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

data. Changing the root haplotype slightly alters the obtained estimates. However, the chosen root haplotype presented in Chapter3 (the center of the DS6 Quebec haplotype network, see Figure 3.2 of Chapter3) does seem to be the most appropriate given the star-like structure of the network (see Results). While assumptions of demography do not affect point estimates of the TMRCA, they do affect the confi- dence intervals obtained for the estimates. In addition to calculating confidence intervals under the assump- tion of a star-like genealogy (presented in Chapter3), we obtain confidence intervals under the assumption of constant population size and exponential population growth (shown in the Supplemental Results (Section B.3)). We also assess sensitivity of the results to the mutation rate assumed by examining estimates using mutation rates of 0.005 and 0.01, both within the range of values in Table B.3. To calculate the divergence time between two populations “A” and “B,” we used Zhivotovsky’s [90, 91, Quebec 314, 315] TD estimator using minisatellite genotypes of the DS6 lineage in each population. The estimate is calculated as TD = t¯D2w, where the average is taken over each locus and w is the average mutation rate per locus (0.001 per locus per generation) (see Equation 1 of [314]). For each locus, tD = 2 D12V0, where D1 = (rA − rB) + VA + VB. V0 is the estimated variance in repeat number in the ancestral population at the time of divergence, and in population X, rX is the mean and VX is the variance in repeat number. V0 is typically unknown; results reported indicate the value of V0 used. A value of V0 = 0 assumes no ancestral allele size variance and provides an absolute earliest bound for TD. To obtain intermediate estimates, we examined the haplotypes shared between populations “A” and “B” and calculated the variance of these haplotypes in either population; in the results, these estimates for V0 are presented as Var(A) and Var(B), respectively. Standard errors are obtained using the standard deviations of the locus tD estimates [314].

As for the TMRCA estimates, we assess sensitivity of the estimates of TD to mutation rate by performing estimation under the mutation rates of 0.005 and 0.01. Under the assumption of a simple island model, estimates of Nm, where m is the probability that each gamete is an immigrant each generation and Nm is the absolute number of migrants each generation, were obtained using Slatkin’s private alleles method for minisatellite haplotypes [251, 253]. Estimates were ob- tained using the equation log10(p(1)) = alog10(Nm) + b, where p(1) is the proportion of private alleles, and a and b are pre-determined constants for a reference sample size of 50 ([253], Equation 14). To account for unequal sample sizes of the “islands” (QU, SK, and AB), we calculated Nm for a sample of 50 individuals (without replacement) from each island and obtained a distribution of Nm estimates over 1000 samples. The upper and lower 2.5% tails of the Nm estimates were used as an empirical 95% confidence interval. For these calculations, we removed any haplotype missing a genotype in at least one position. We exclude ON from this calculation due to its low sample size; MB is not included since minisatellite genotypes were unavailable.

As a note, in our study we refer to M. tb. Ne in the standard population genetic context. We define effective population size at the “epidemiological level,” as do Reyes and Tanaka [223]. As with molecular epidemiological studies, we analyze a single M. tb. isolate from each individual with TB, and do not attempt to model bacterial genetic diversity within individual hosts. Where each infected host is associated with a single pathogen, Ne of the pathogen should be proportional to the number of infected hosts - i.e. prevalence B.2. SUPPLEMENTAL METHODS 123

- and a number of studies have demonstrated congruence between the demographic history inferred from pathogen genetic data and temporal trends in prevalence (e.g. [191] for a bacterial infection and [27] for a viral illness). A recent simulation study [86] found pathogen Ne to be better approximated by disease incidence. In this study (see Supplemental Results (Section B.3 below), we used the harmonic mean of annual incident TB cases as an estimate of M. tb. Ne.

B.2.5 Rejection Sampling Details

Background

Under the Bayesian framework for parameter inference, we assume data D is generated by a probabilis- tic model governed by the parameters θ with prior density f(θ). Of interest is the posterior distribution, f(θ|D) ∝ f(D|θ)f(θ), where f(D|θ) is the data likelihood. For complicated models, this likelihood is difficult to compute [155, 167, 187]. For example, the likelihood for data generated under the coalescent framework, as in this study, can only be obtained after conditioning on a particular genealogy, G, and inte- R grating over the entire space of possible genealogies G; i.e., G f(D|θ)f(G)dG [187]. This is an integration over an extremely large space, and is impractical to carry out. A number of Monte-Carlo methods have been proposed for sampling from the parameter posterior dis- tribution, f(θ|D). Available packages employing Markov-Chain Monte Carlo methods have been used with high accuracy and success [21, 108, 109, 187]; however, these methods do not enable inference of more com- plicated demographic scenarios or inference with multiple types of genetic loci [298]. In our study, we at- tempted to use IM [108, 109, 187] for parameter inference, and observed poor performance and convergence. Alternatively, an Approximate Bayesian Computation (ABC) method, rejection sampling, has been shown to produce accurate and interpretable results for inference of coalescent models [71, 200, 210, 219, 298]. In the rejection sampling framework, rather than conditioning on the full data D, several statistics S are used to summarize the full data. The general approach is to first calculate from the data observed values for the chosen summary statistics

(Sobs). Then:

1. Draw θ’ from its prior distribution, f(θ).

2. Simulate data D using the coalescent and calculate summary statistics S.

3. Accept θ’ if the Euclidean distance ||S−Sobs|| < . (Rather than an absolute cutoff , in our implemen-

tation, the threshold was chosen as a particular quantile (Pδ) of all performed simulations (1,000,000 in total)).

4. Repeat steps 1-3 until a desired number of samples of θ are accepted. (In our implementation, we

performed 1,000,000 simulations, of which a percentage (Pδ) were accepted).

5. Use the accepted simulations to form a posterior distribution for θ. 124 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

The outlined approach (hereafter referred to as the “Rejection” approach) assumes that the likelihood of the statistics f(S|θ) is constant in the range where S ≈ Sobs (for more details, see [155]). Recently, a number of modifications to the above procedure have been made that do not make this assumption [17, 70, 71, 155]. One method [155] proposes a general linear model for the summary statistics given the parameters (θ), i.e.,

S|θ = Cθ + c0 +  with  ∼ N(0, σs). The coefficients C and c0 can be estimated by ordinary least squares, and are used when calculating the posterior distribution of θ in step 5 [155]. The authors claim that this method (hereafter referred to as the “GLM” method) takes into account the local linearity and strong colinearity of the summary statistics. The rejection sampling framework described also allows for selection between demographic models [70, 71, 210, 219]. A model’s posterior probability can be estimated by the proportion of simulations under the model that are accepted when also considering the distances ||S −Sobs|| from other models; see details below.

Implementation details

We present in Chapter3 the rejection sampling procedure for demographic modeling of an expansion in the Remote Saskatchewan (RS) population. Based on the diversity results (see Chapter3 and Supplemental Re- sults (Section B.3), it appears likely that the RS population was historically smaller than its current size, and thus likely underwent a recent expansion. Based on timing of historical events (i.e. expanded transportation networks) that would have opened these areas for development and thus rendered host populations susceptible to epidemic TB (see Supplemental Historical Background for details (Section B.1), we estimate the expansion time in RS to be around 1930. This is equal to 54 transmission generations (1.2 years in length, as described earlier) prior to the midpoint of the interval in which study samples were collected (1995). We estimated the expansion parameter ω, which is equal to the ratio of the historical effective population size (prior to the expansion) to the current effective population size. The expansion described by the parameter ω in this model is within a reasonable time frame for estimation, as found from previous work using the rejection sampling procedure (see below, in the section Exploration of Alternative Demographic Models (Section B.2.5)). In order to reduce noise in parameter estimation, we focus only on isolates from the DS6Quebec lineage. For the proposed demographic expansion model, we fix the time of expansion at 54 generations, and draw ω from a prior Uniform distribution on [0.01, 1] (step 1 in the Background section (B.2.5) above). Assuming a constant size in RS after the expansion, the current effective population size was estimated to be 544. As described in Chapter3, this estimate was based on the harmonic mean of RS TB case counts from 1986- 2004, which was then multiplied by the frequency of the DS6Quebec lineage and 18, the number of years over which exhaustive samples were taken (see “Statistical Calculation” section above for an interpretation of Ne). As in step 2 above and using these demographic parameters, we performed 1,000,000 coalescent simulations, with a sample size of 195 individuals, of minisatellite haplotypes under a SMM (with a mutation rate of 0.001 per locus per transmission generation and using SIMCOAL 2.0 [73, 151]). Also as part of step 2, we calculated summary statistics from the observed and simulated minisatellite data using Python scripts. Haplotype statistics (such as the number of distinct haplotypes, the number of singleton haplotypes, B.2. SUPPLEMENTAL METHODS 125

and haplotype diversity), rather than only single locus statistics, were again employed given the limited documentation of horizontal gene transfer in M. tb. We also calculated the mean variance in repeat size (over loci and isolates) as a summary statistic, which has desirable theoretical properties [90, 315]. These statistics have also been used in previous studies [17, 70, 71, 219]. When calculating summary statistics for individuals with one missing allele in RS (38 individuals), we imputed the missing allele as the most frequent allele at that position. Analogously, in the coalescent simulations, we randomly masked and imputed one position in 38 individuals. This imputation has little quantitative effect on the results. For steps 3 and 4 above, to increase speed, the program ABCestimator was used to calculate Euclidean distances between the observed and simulated summary statistics (||S−Sobs||) and to output a list of simulated parameter sets accepted under a particular distance threshold Pδ [298]. Results were not sensitive to the Pδ threshold; we used a cutoff of 5,000 retained simulations out of 1,000,000 simulations (Pδ = 0.005). For step 5, the ABCestimator program was used to obtain posterior distributions under the “GLM” post-rejection approach; posterior distributions under the regular “Rejection” approach were obtained using the density() function in R (see Background section (B.2.5)). We also performed model selection under the rejection sampling framework by estimating the posterior probability of a population expansion model versus a null constant size model. We first performed an addi- tional 1,000,000 simulations under a null constant size model. Then, we ranked the 2,000,000 simulations of both models (1,000,000 from each model) according to their Euclidean distances ||S − Sobs|| from the observed summary statistics. (Note that for the constant size model, no parameter is estimated, and only the

Euclidean distances ||S − Sobs|| are computed). We calculated the posterior probability of each model as the proportion of simulations of each model in the smallest 500 (of the 2,000,000) distance values; results were not sensitive to these cutoffs. The method naturally penalizes for models with more parameters, as the parameter space is less explored for these models (see [298]). We performed this analysis using scripts in R.

Assessment of Sensitivity to Mutation Rate

We also assessed the sensitivity of the ABC procedure to the assumed mutation rate. In addition to performing parameter estimation assuming a mutation rate of 0.001 per locus per generation, we followed the identical estimation procedure using mutation rates of 0.0005, 0.005, and 0.01. These mutation rate estimates are in the range of values presented in Table B.3 and are also in the range of values estimated by Reyes and Tanaka (2010) ([0.00084-0.018] per locus per generation assuming a generation time of 1.2 years) [223]. In addition to using alternative point estimates of the mutation rate, we allowed the mutation rate to have a Uniform prior on [0.005, 0.02] (again motivated by the range of estimates described above). We follow the same procedure in estimating ω by examining ω values in the top Pδ of simulations with the lowest distance values; analogously, we examined the mutation rate values in the top Pδ of simulations to co-estimate the mutation rate parameter. 126 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

Method Validation Procedures

We validated the proposed method for inference of the expansion parameter (ω) in the one-population model (where again, ω is the ratio of the historical population size before the expansion to the current population size). First, for 100,000 coalescent simulations with ω ∼ U[0.01, 1] (as described above), we examined the correlations between summary statistics, and with ω. Then, to assess the accuracy of parameter estimation, we simulated datasets with known values of ω (300 simulated datasets each for ω = 0.1, 0.5 and 0.8) under an identical expansion model, again using SIMCOAL 2.0. To estimate the ω parameter for these simulated datasets, we employed the rejection sampling procedure with the same 1,000,000 simulations (under the

expansion model) used for inference of the observed RS data. Using a tolerance of Pδ = 0.005 as described above, means and modes of the posterior distribution for ω were obtained for each dataset using both the Rejection and GLM methods. The mean squared error was calculated for each dataset as MSE = P(ω − ωˆ)2, where ω is the true assumed value (ω = 0.1, 0.5 or 0.8) and ωˆ is the estimated value (taken as the posterior mean). Standard errors of the MSE were calculated by bootstrapping over all datasets. To assess the performance of model selection, we followed the procedure used for the RS data and used the 1,000,000 simulations under the expansion and constant size models (2,000,000 simulations in total). The ability to select the expansion model over the constant size model was assessed for each of the 300 simulations under each assumed ω (0.1, 0.5, and 0.8, as above). We also performed model selection for 300 datasets simulated under a constant size model (ω = 1). For each of the 300 datasets (for each ω value), we calculated the posterior probability of both models as described (by examining the proportion of each model in the smallest 500 (out of 2,000,000) Euclidean distance values). A posterior probability above the liberal cutoff value of 0.5 was used to select one model over another.

Exploration of Alternative Demographic Models

Originally, our goal was to estimate the parameters of a bottleneck in the Non-Remote Saskatchewan (NRS) population when founded from Quebec. We intended to compare several demographic models: the first

model was a founder model with a bottleneck of size NBN in the NRS population when founded from QU; the second model allowed for bi-directional migration at rate m during the bottleneck. We fixed the

time of the bottleneck (T1 = 164) and the time of population expansion (T2 = 100) based on historical timings

described in Chapter3 (see Table 3.2). The parameters to be estimated were NBN (ratio of the bottleneck size to the current size) and m (the rate of migration between the two populations, in the migration model only).

Uniform priors were assumed for NBN ([0.01, 1]) and m ([0.001, 1]). As described for the RS inference,

current effective population sizes (Ne) in each population were estimated from historical TB incidence data: because we assumed constant population sizes (before and after the expansion in NRS and at all times in QU), we took the harmonic mean of case counts and multiplied this by the DS6Quebec lineage frequency and the number of years over which samples were collected to obtain the Ne of each population (as for the RS

Ne estimation).

We performed simulations of 12 minisatellite loci (with the fixed parameters T1, T2, NeQU and NeSK , B.3. SUPPLEMENTAL RESULTS 127

and variable parameters NBN ∼ U[0.01, 1] and m ∼ U[0.001, 1]) for use in the rejection sampling procedure. The summary statistics S of the minisatellite data used in the procedure provided information not only about the relationships between the two populations but also about each individual population and have been used in previous studies [17, 70, 71, 219]. As in the RS inference, we examined haplotype-based statistics such as haplotype diversity, as well as the proportion of private and shared minisatellite haplotypes out of the total number of distinct haplotypes. We also calculated δµ2, the average squared distance in allele size between populations, and the variance in allele size within populations [90, 315]. As in the RS inference, for individuals with one missing allele in NRS, we imputed it as the most frequent allele at that position for these calculations. After implementing the rejection sampling approach and method validation procedures as described pre- viously (modified as above), we were unable to estimate even NBN alone with reasonable accuracy. We were also not able to accurately estimate m, nor perform model selection between the bottleneck and migration models. We explored possible reasons for the poor performance of the method (for example, by allowing for only uni-directional migration and by decreasing the ranges of the prior distributions). We discovered differences in performance when the fixed timings of the model - the founding time (T1) and the expansion time (T2) - were changed. Performance was consistently poorer for demographic models where either event occurred further in the past; this is likely due to the high mutation rate of the minisatellite loci. Accuracy con- siderably improved when the demographic events occurred within 25-50 generations in the past. Given the fixed historical times of the bottleneck and expansion in NRS, we concluded that we were unable to estimate the parameters of the proposed model with our method and data.

B.3 Supplemental Results

B.3.1 Genetic Timing Estimates

We calculated estimates of TMRCA for additional populations (other than QU and SK, presented in Chap- ter3) using the method of Ytime. Point estimates and 95% confidence intervals in generations from the present for each population are as follows; Non-Remote Saskatchewan (NRS): 150.4 (24.4, 376.7); Remote Saskatchewan (RS): 192.7 (38.4, 385.1); and Alberta (AB): 149.3 (45.3, 417.2). These results also follow historical expectations under the Canadian fur trade (as described in Chapter3), with calendar year estimates for the preceding TMRCA point estimates approximately equal to 1814, 1791, and 1779, respectively. Since the DS6Quebec network is not perfectly star-like (not all haplotypes are derived immediately from the root haplotype), the accuracy of TMRCA estimates may be affected by this violation of Ytime model assumptions. We also explored sensitivity of TMRCA estimates to the assumed root haplotype by obtaining estimates with alternative root haplotypes. Alternative root haplotypes generally increased the TMRCA estimates, supporting that the haplotype at the center of the DS6Quebec network (haplotype “233325153324,” at high frequencies in all populations, results presented in Chapter3) is the most appropriate. For example, using instead the root haplotype “233325153325” (at low frequency in QU but the next most frequent haplotype in 128 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

both SK and AB), increases the TMRCA in QU from 212.5 to 298.2 and in SK from 164.8 to 209.9 years from the present. For the populations for which TMRCA is presented in Chapter3 (QU, AB, SK), we obtained point esti- mates under alternative mutation rate estimates. As expected, the estimates scale directly with the mutation rate: for mutation rates of 0.005 and 0.01, respectively, the TMRCAs for QU are 1944 and 1969, for SK are 1955 and 1975, and for AB are 1952 and 1973. Although TMRCA estimates certainly are sensitive to mu- tation rate, our conclusions regarding spread of the DS6Quebec lineage via the fur trade do not rely on these estimates. We cite other strong evidence (lineage frequencies, haplotype networks and genetic differentiation, and genetic diversity) in support of the fur trade hypothesis, none of which depend on these timing estimates. In fact, given their large confidence intervals, these timing estimates do not contradict our conclusions. We calculated additional confidence intervals for the TMRCA estimates under different demographic assumptions (using the original mutation rate of 0.001). First, we obtained estimates under an exponential growth model with r = 0.0473. (Given the point estimate obtained from the ABC procedure of the ratio of the current to ancestral population size in RS (1/ω = 12.886), we estimate r = 0.0473 given exponential growth over 54 generations). Though this rate may not hold for all populations, it is well-motivated in that it assumes a similar rate of growth over a similar time period in all populations. Confidence intervals under this demography were, in calendar years: QU (1435-1932); SK (1552-1960); AB (1516-1950). These results are very similar to those obtained under the assumption of a constant population size: QU (1414-1934); SK (1537-1952); AB (1495-1941). We then calculated TMRCA confidence intervals under the assumption of a “star-like” genealogy. For all populations, the confidence intervals of the timing estimates are much narrower, likely because actually specifying a haplotype genealogy reduces uncertainty in the TMRCA estimation. Confidence intervals for each population under a star-like demography were, in calendar years: QU (1709-1771); SK (1777-1815); AB (1751-1804). Thus, under the assumption of a star-like genealogy and a mutation rate of 0.001, these TMRCA estimates are quite consistent with introduction of the DS6Quebec lineage to indigenous populations via the fur trade.

We also estimated the divergence time (TD) between NRS and QU, using the method of Zhivotovsky [314] as described in Chapter3 and Supplemental Methods (Section B.2). Estimates using several values

of V0 were obtained: 1) V0 = 0, 2) variance in NRS of haplotypes shared between NRS and QU, and 3)

variance in QU of haplotypes shared between NRS and QU. For these V0 values, the respective divergence time estimates between QU and NRS in generations from the present (with corresponding dates in calendar year), are 167.5 (1794), 83.2 (1895), and 89.6 (1887). Standard deviations of each of these estimates are

71.3, 30.5, and 31.5, respectively. Although confidence intervals and ranges of possible TD values are large, historical timing estimates of the fur trade are within these ranges, as described in Chapter3.

For the population pairs for which TD is presented in Chapter3 (QU/SK and QU/AB), we obtained point estimates under alternative mutation rate estimates. Using a higher mutation rate of 0.005, the three point estimates (in the same order as in Chapter3 Table 3.2 and above), are more recent: QU/SK: 1954, 1973, B.3. SUPPLEMENTAL RESULTS 129

1978; QU/AB: 1952, 1976, 1973. Using an even higher mutation rate of 0.01, the three point estimates are: QU/SK: 1974, 1984, 1986; QU/AB: 1973, 1986, 1984. As mentioned for the TMRCA estimates, although the mutation rate does greatly affect the TD estimates obtained, our conclusions do not rely on these timing estimates, nor do these more recent timing estimates contradict our conclusions.

B.3.2 Patterns of Genetic Differentiation

We observed lineage-specific patterns of M. tb. genetic differentiation, with as much as 52% of total variation (H37Rv-like lineage) partitioned among populations. Differentiation of lineages other than DS6Quebec may be due to the fact that contact occurred between small segments of sending and receiving populations, and/or the temporary nature of contact (i.e. other lineages were introduced later than DS6Quebec). Along with recurrent population bottlenecks, such patchy and stuttering M. tb. migration dynamics may help explain low levels of DNA sequence diversity observed in M. tb. populations.

B.3.3 Estimates of Population Diversity

Results of diversity calculations from different combinations of populations and other genetic loci gave results consistent with the RFLP results described in Chapter3. Using rarefaction to examine the number of distinct RFLP types in each population (as in Figure 3.3A of Chapter3) on splitting SK into RS and NRS, we find that QU has the highest, and RS and MB the lowest, numbers of distinct haplotypes; NRS, AB, and ON all have intermediate numbers of distinct haplotypes. Calculating diversity of RFLP types without splitting the SK population (with re-sampling to the size of the smallest sample to correct for sample size differences as in Figure 3.3B in Chapter3) we find results consistent with Figure 3.3A of Chapter3: SK and MB have the lowest, QU has the highest, and AB has intermediate diversity. Note that for the calculation of diversities using this re-sampling procedure, we did not include ON due to its small sample size (n = 45). Rarefaction using minisatellite haplotypes for both the number of distinct and the number of private haplotypes produces a pattern entirely consistent with RFLP results when splitting SK into NRS and RS (Figure B.2A and B.2B) and when not splitting the SK population; the ordering of the populations is similar for both the number of private and the number of distinct haplotypes. We also find that minisatellite haplotype diversity data follows the patterns described above and presented in Chapter3 (not splitting SK, Figure B.2C; splitting SK, Figure B.2D). Results from the permutation procedure assessing whether the observed minisatellite haplotype diversity in each population can be explained by a random partitioning of haplotypes into each population are shown in Figure B.3. In Figure B.3A, we include QU isolates in the pool of common haplotypes (along with NRS, RS, AB, and ON isolates), and find that when isolates are randomly distributed among populations, the observed diversity in QU is higher than expected; in RS, it is lower than expected. We then exclude QU isolates from the pool of haplotypes (Figure B.3B), and find that when isolates are randomly distributed, only the diversity in RS is lower than expected. (Again, minisatellite haplotypes are not available from MB strains). 130 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

Based on these diversity and permutation calculations using multiple genetic loci, we confirm the conclu- sion stated in Chapter3 that the observed genetic diversities are consistent with the pattern of social networks created as a byproduct of the French Canadian fur trade. Consistently, QU has the highest diversity (and highest number of private haplotypes), as expected from its role as a source population for the isolates in all other populations. Diversity and the number of private haplotypes are generally intermediate in the founded populations of AB, NRS, and ON. Finally, diversity and the number of private haplotypes are lowest in his- torically remote regions (parts of MB and RS, see Figure 3.1A of Chapter3), consistent with M. tb. from these regions undergoing a more recent expansion in population size.

B.3.4 Rejection Sampling Validation Results

First, we examined the correlations between the summary statistics S (number of distinct haplotypes, number of singleton haplotypes, haplotype diversity, and mean variance in repeat sizes) used in the rejection sam- pling procedure for the expansion model in RS. Diversity and the number of distinct haplotypes are highly correlated (correlation of 0.8888), as are the number of singletons and the number of distinct haplotypes (0.8765). Variance in allele size is only moderately correlated with the number of distinct haplotypes and the number of singletons (correlations of 0.6032 and 0.5818, respectively); haplotype diversity and the number of singletons have a modest correlation of 0.6625. Below, we discuss performance of the rejection sampling procedure in estimating ω in order to assess whether these correlations warrant the removal of any of the summary statistics. Motivated by the above correlations, we examined the accuracy of the ω estimates obtained when using only two summary statistics (the variance in allele size and haplotype diversity), rather than all four corre- lated statistics (including the number of singletons and the number of distinct haplotypes). For the Rejection procedure, we find that there is little effect of removal of the two summary statistics (the number of singletons and distinct haplotypes) on the ω estimates (assessed by mean squared error, or MSE) (Figure B.4A; con- clusions are similar for the GLM results). Because of this, we decided to use all four summary statistics in our subsequent analyses. As expected, accuracy of the procedure is consistently higher for more pronounced expansions (low ω), and accuracy decreases as the historical population size before the expansion becomes more similar to the current size (ω close to 1). In Figure B.4C, we show estimates of ω from the 300 test simulations for each of the assumed ω values. For pronounced expansions (ω = 0.1), both posterior means and posterior modes are accurately centered on low ω values for the Rejection approach. For intermediate ω values, posterior modes are almost uniformly distributed across the prior interval, indicating more noise in the estimated posterior densities. True ω values near 1 result in generally higher posterior mean and mode estimates, although the spread of ω estimates is quite large. For all ω values, we find that the post-rejection GLM procedure proposed by Wegmann et al. [298] does not significantly improve performance. For the remainder of our results, we use only estimates obtained from the Rejection, and not the GLM, procedure. Next, we examine the ability of the method to distinguish between the expansion model and a null constant B.3. SUPPLEMENTAL RESULTS 131

population size model based on the posterior probabilities of each model (selecting a particular model if its posterior probability is above 0.5). Interestingly, the combinations of summary statistics used slightly alter the model selection results (Figure B.5B); another motivation for choosing to use all four summary statistics, rather than only diversity and variance in allele size, is that it generally improves model selection performance for the expansion models. Model selection is 100% accurate for pronounced expansions (ω = 0.1), and as expected, becomes less accurate as the expansions become milder (with ω = 0.8, the expansion model is selected over the constant size model only 60% of the time). Model posterior probabilities are close to 0.5 for mild expansions (histograms of the actual posterior probabilities from each simulation are shown in Figure B.5D), indicating that as the population expansions become less pronounced, the two models become harder to distinguish. The false positive rate of our method (selecting the expansion model when data are simulated under the null constant size model) is ≈ 40% when using all four statistics (Figure B.5B). Despite this high rate, we are able to confidently assign p-values to the posterior probabilities of the expansion model. This is done by using a null distribution of posterior probabilities estimated from data simulated under the null hypothesis of constant population size (Figure B.5D shows the converse, bottom right panel); see results below.

B.3.5 Results of Rejection Sampling Procedures

Summary statistics and observed values calculated for the imputed RS genotypes are as follows: number of distinct haplotypes (14); number of singletons (6); minisatellite haplotype diversity (1.6925); and average variance in allele size over loci (0.1077). These four summary statistics used with the rejection procedure outlined previously gave the posterior distribution for ω shown in Figure B.4A. We also find that the observed summary statistics for RS are similar to those for the retained simulations, with slightly more variation for variance in allele size and the number of distinct haplotypes (Figure B.4B). With the methods described above, we calculated a posterior probability of 1 for the expansion model versus the null constant size model. As a null distribution for this posterior probability, we used the posterior probabilities of the expansion model estimated for each of the 300 constant size model simulations (described above). From this null distribution, we determined that the p-value of our observed posterior probability (1) is 0; thus, the expansion model with the additional parameter ω is indeed significant in the RS population. As stated in the Methods (Chapter3), we assessed the sensitivity of the procedure with the observed RS data to the minisatellite mutation rate assumed. First, in the coalescent simulations used for inference, we used different point estimates of the mutation rate (0.0005, 0.005, and 0.01; i.e., both above and below the current assumed value of 0.001). From these simulations, we inferred the expansion parameter ω for the observed data, as well as the posterior probability of the expansion model and its corresponding p-value. For mutation rates of 0.005 and 0.01, nearly all observed summary statistics are outside of the range of simulated summary statistics for both the expansion and contraction models (for example, the observed value of 14 distinct haplotypes is far lower than the number distinct haplotypes in any simulation with these mutation rates - in contrast to Figure B.4B). This indicates that these mutation rates are likely too high to 132 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

produce the patterns of diversity that we observe. For all mutation rates, the 95% credibility intervals of ω for the observed data are generally larger than those obtained using a mutation rate of 0.001. Point estimates and 95% credibility intervals for ω are as follows for the indicated mutation rates: 0.0005: ω = 0.384 [0.08, 0.89]; 0.005: ω = 0.144 [0, 0.76]; 0.01: ω = 0.321 [0.02, 0.94]. The wider credibility intervals for ω under these mutation rates are expected. In our analyses examining alternative demographic scenarios (see previous section), we inferred that the mutation rate of 0.001 allowed us to accurately infer events between 25-50 generations in the past. The mutation rate of 0.0005 is thus likely too slow to allow accurate inference for events occurring relatively recently. Aside from producing summary statistics largely inconsistent with the observed data, the high mutation rates likely result in too much homoplasy to accurately infer events at this time scale, resulting in larger credibility intervals. For all mutation rates, we examine the posterior probability of the contraction model over the constant size model for our observed data. In all cases, the expansion model is significant. Posterior probabilities and p-values (in parentheses) for each mutation rate are as follows: 0.0005: 0.852 (0.02); 0.005: 1 (0); 0.01: 0.976 (0). Thus, for the observed RS data, we can reject the constant size model regardless of mutation rate. We also assessed the effects of drawing the mutation rate for the ABC procedure on a uniform distribution from [0.0005, 0.02] (mirroring the range of Reyes and Tanaka [223]). In performing estimation for ω in a similar manner, the estimate (ω = 0.234) had a large credibility interval ([0.016, 0.824]), likely reflecting the increased uncertainty in the mutation rate parameter. When estimated in conjunction with ω, the mutation rate parameter had a posterior credibility interval on [0.0005, 0.0012]. Notably, this credibility interval for the mutation rate includes our assumed rate of 0.001. This result should be qualified, as given that our method examines only four summary statistics, a confi- dent estimation of two parameters from these statistics may not be possible. However, we do conclude that our assumed mutation rate of 0.001 is in fact reasonable based on this joint estimation using the ABC pro- cedure. For the observed data, the posterior probability of the expansion model over the constant size model (0.96) was again significant (p-value = 0.0033). In all cases, regardless of mutation rate, the constant size model is rejected in favor of the contraction model in RS. Due to the timing of the expansion event of interest (54 generations in the past), estimates of the ω parameter have larger credibility intervals at alternative muta- tion rates for the reasons described above. Nevertheless, our conclusions regarding the expansion in RS are robust to alternative assumptions of minisatellite mutation rates.

B.3.6 Historical Estimates of Migration between Quebec and the Northwest, 1710- 1870

In order to estimate the number of French Canadian migrants to Western provinces, we have relied on fur trade documents (employee records), as there are no reliable censuses for this region, during this time period. Fur trade documents refer to the “Northwest” (NW), which encompasses the region described in the Supplemental background material (Section B.1) (present day Ontario west of “Grand Portage” at the Western edge of Lake Superior, Manitoba, Saskatchewan, Alberta, British Columbia, the Northwest Territories). B.3. SUPPLEMENTAL RESULTS 133

Expansion of the fur trade into this region (from the Montreal route) occurred after 1710 [120]. From 1710-1800, trade in the region increased. Between 1720 and 1730, the annual employee count in the North- west was ≈ 49 men. This increased to ≈ 140 men employed annually in the years from 1739-1752. Employee accounts from 1773 and 1798 were 205 and 917 employees, respectively. For the period 1710-1800, we use the average of these censuses, 328, as an estimate of the number of employees per year. From 1800 to 1820, trade remained relatively stable [120]. For this period we use an account from 1805 of 1,610 men. After the merger of the two major fur trading companies (the North West Company and Hudson’s Bay Company) in 1820, trade was consolidated and the Montreal route was abandoned, although a number of French Canadian voyageurs were kept on by the Hudson’s Bay Company. For this period we use as an estimate of the annual count of employees the number of voyageurs employed by the Hudson’s Bay Company in 1857, equal to 500 men (all preceding figures are from Innis [120] and references within). After 1870, traditional forms of fur trading concluded [120], as did Westward migration of French Canadians [92].

In order to estimate the absolute number of migrants to the Northwest using these employee counts, it is necessary to have an estimate of employee turnover: that is, how long did fur trade employees remain in the interior before dying, or returning to Quebec? A contract extended over at least two years, as the length and difficulty of the journey precluded outbound and return travel in a single season. Contracts for low level employees (including voyageurs) were generally a few years in length, but could extend for as long as 6 years [120]. Employees commonly extended their residence in the interior over multiple contracts. In order to estimate the average length of time spent in the interior, we used the published analysis of a 1785-6 account book from a Saskatchewan fur trade post [66]. The authors of this study cross-referenced names and demographic information from European traders mentioned in the account book with other archival material and were thus able to describe the career trajectories of the 105 individuals named therein. For each individual, we calculated the number of years spent in the interior, assuming that it was two years for individuals who appear in this journal but for whom no other records exist. Under these assumptions, European traders spent an average of 16 years in the interior.

Calculations performed to obtain our historical estimate of Nm (absolute number of M. tuberculosis infections introduced per transmission generation) are shown in Table B.2, with each column indicating the three historical periods over which estimates were obtained. We assume that the population turned over every 16 years, so that the total number of migrants per population is the product of the annual census and the number of turnovers expected for the period. With these assumptions, the total number of migrants from 1710-1870 (over all three periods) is equal to 964 individuals per population (i.e. to each of ON, MB, SK, AB, BC and NWT, assuming that migrants were divided equally among receiving populations). Based on 1% annual tuberculosis mortality in European cities in the mid 18th century [94], 50% mortality among cases and disease duration of 2 years [261], we estimate a prevalence of 4% of infectious tuberculosis among the French Canadian colonists. This may be high, since the healthy may be over-represented among immigrants [46]. With 4% prevalence of tuberculosis disease among (human) migrants to the Northwest, 39 infections would have been introduced to each population. 134 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

For the purpose of comparison with genetic analyses of migration based on an island model, contact under this model extends from 1710 to 1995 (midpoint of the study period). Dividing this by the generation time of 1.2 years results in an estimate of 237.5 transmission generations of contact. Finally, our historical estimate of Nm is equal to the absolute number of migrant infections (39) divided by the number of generations (237.5) - this yields Nm = 0.16. If we assume that the number of migrants to each receiving population was proportional to the number of regional trading posts (instead of assuming that migration was equal to each population), historical estimates of Nm range between 0.04 (BC) and 0.25 (SK, AB). These calculations are made under a number of simplifying assumptions; as a result, we intend this result to be an order of magnitude estimate (i.e., historical Nm < 1). B.4. SUPPLEMENTAL FIGURES AND TABLES 135

B.4 Supplemental Figures and Tables

A. B.

C.

Fig. S1. M. tuberculosis minimum spanning trees based on minisatellite data and lineage typing. M. tuberculosis isolates from the QU population are in red, from the ON population are in yellow, from the SK population are in green, and from the AB population are in purple (minisatellite genotypes for the MB population were unavailable). The Rd 219 lineage is shown in A,Rd182isshowninB,andtheH37Rv-likelineageisshowninC. H37Rv-like isolates have the FigurekatG CTG B.1:→CGGM. mutation, tb. indicatingminimum that they spanning are members of trees the Euro-American based on clade, minisatellite but lack any of the genomic data deletions and lineage (Table S1) that typing. would identifyM. tb. them with a specificlineagewithintheclade. isolates from QU are in red, ON in yellow, SK in green and AB in purple (note minisatellite genotypes unavailable for MB). The Rd 219 lineage is shown in panel A, Rd 182 is shown in panel B, and the “H37Rv- like” lineage is shown in panel C. H37Rv-like isolates have the katG CTG to CGG mutation, indicating that they are members of the Euro-American clade, but lack any of the genomic deletions that would identify them with a specific lineage within the clade.

Pepperell et al. www.pnas.org/cgi/content/short/1016708108 9 of 14 136 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

Figure B.2: Diversity calculations obtained from minisatellite data. A) Number of distinct minisatel- lite haplotypes as a function of the number of sampled chromosomes obtained using rarefaction (see Meth- ods, Chapter3). Populations are Quebec (QU), Ontario (ON), Non-Remote Saskatchewan (NRS), Remote Saskatchewan (RS), and Alberta (AB) (MB not included since minisatellite genotypes not available). B) Number of private minisatellite haplotypes as a function of the number of sampled chromosomes; as in part A. C) Minisatellite haplotype diversity, corrected for sample size differences by re-sampling to the size of the smallest sample (AB, n = 266), as described in text. Populations are Quebec (QU), Saskatchewan (SK), and Alberta (AB). D) Minisatellite haplotype diversity, corrected for sample size differences by re-sampling to the size of the smallest sample (NRS, n = 107). Populations are Quebec (QU), Non-Remote Saskatchewan (NRS), Remote Saskatchewan (RS), and Alberta (AB). B.4. SUPPLEMENTAL FIGURES AND TABLES 137

Figure B.3: Distribution of minisatellite haplotype diversities obtained from the permutation procedure under a random distribution of isolates to populations (described in Supplemental Methods (Section B.2)), with single vertical lines indicating the observed diversities in each population. A) Histograms are obtained by pooling isolates from all populations (QU, ON, AB, NRS, RS), and for each population, randomly selecting the indicated number of isolates (n) and calculating haplotype diversity; this is repeated 1000 times. B) Histograms are obtained from pooling isolates from only the populations Ontario, Alberta, Non-Remote Saskatchewan, and Remote Saskatchewan (not Quebec). 138 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

A 12 1.0

8 0.5

Cumulative Density (CDF) Cumulative 0.0 4 0.0 0.2 0.4 0.6 ω

ProbabilityDensity (PDF) 0 0.0 0.2 0.4 0.6 ω B 2500 Frequency Frequency 0 200 400 600 800 0 500 1500

1.4 1.5 1.6 1.7 1.8 1.9 2.0 0.05 0.10 0.15 0.20 0.25

Diversity Var(Allele Size) 1200 800 Frequency Frequency 400 0 500 1000 1500 0

12 14 16 18 20 2 3 4 5 6 7 8 9

Distinct Haplos Singletons

Figure B.4: Results from rejection sampling procedure in RS. A) Posterior density (main figure) and cumulative posterior density (inset) of ω, the size of RS prior to its expansion in 1930 (relative to the current population size). After translating the ω estimates to absolute numbers, the current size of RS is 30.222 and the historical size was 2.345; 95% Credibility Interval: [0.351, 8.803] (obtained from the cumulative density). B) Summary statistics from the retained simulations used to generate the posterior distribution for ω in the Rejection procedure. Red vertical lines indicate the observed summary statistics (given in Supplemental Results (Section B.3). B.4. SUPPLEMENTAL FIGURES AND TABLES 139

Figure B.5: Validation results for rejection sampling Procedure. A) Mean squared errors (MSEs) for (es- timated using rejection sampling) when data are simulated under different true values assumed for ω (x-axis); ω = 1 is the constant size model. Different colors indicate the different statistics used for parameter inference (“All Statistics” includes the variance in allele size, haplotype diversity, the number of singletons, and the number of distinct haplotypes). Solid lines indicate use of the rejection procedure, and dashed lines indicate the GLM procedure (described in Supplemental Methods (Section B.2.5). B) Model selection performance (y-axis), given as a proportion of the 300 simulations for which the correct demographic model was chosen (i.e., had a posterior probability greater than 0.5); as in part A), this is plotted against the true assumed ω value of the simulation (x-axis). For true ω values of 0.1, 0.5, and 0.8 (expansion model), the y-axis indi- cates power; for ω = 1 (constant size model), the y-axis indicates 1 minus the false positive rate. Different color lines indicate the different statistics used, as in part A. C) Rejection estimates of ω obtained from 300 simulations with the indicated assumed true ω values (left to right: ω = 0.1, 0.5. and 0.8). Histograms of the posterior means and the posterior modes are shown. D) Histograms of posterior probabilities of the expansion demographic model obtained for the 300 simulations under various true assumed ω values (expansion model: ω = 0.1, 0.5, and 0.8; constant size model: ω = 1). 140 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

Table B.1: Frequencies of M. tuberculosis lineages in four populations. Population Lineage QU n (%) ON n (%) SK n (%) AB n (%) DS6Quebec 144 (48) 17 (38) 276 (62) 157 (55) H37Rv-like 53 (18) 14 (31) 152 (34) 96 (34) Rd 115 5 (2) 0 1 (0.2) 2 (0.7) Rd 174 1 (0.3) 0 0 0 Rd 182 55 (19) 1 (2) 3 (0.7) 4 (1) Rd 183 1 (0.3) 5 (11) 0 2 (0.7) Rd 219 32 (11) 7 (16) 12 (3) 19 (7) Non-Euro-American clade 6 (2) 1 (2) 0 3 (1) Total sample 297 45 444 283 *These isolates have the katG CTG to CGG mutation, indicating that they are in the Euro-American clade, but they do not have any of the lineage defining genomic deletions (DS6Quebec, Rd 115, and so forth). B.4. SUPPLEMENTAL FIGURES AND TABLES 141

Table B.2: Fur trade migrants, Quebec to the Northwest, 1710 - 1870. Phase of Fur Trade Population Data Growth Stabilization Contraction (1710-1800) (1800-1820) (1820-1870) Northwest (NW) census* 328 1,610 500 Populations within NW† 5 6 6 Census per population 66 268 83 Time (years) 90 20 50 Employee turnovers 5.625 1.25 3.125 Total migrants per population 369 335 260 *Annual census of French Canadian fur trade employees resident in the Northwest. †The number of distinct populations within the NW. For 1710-1800, this includes Western Ontario, Manitoba, Saskatchewan, Alberta and the Northwest Territories. After 1800, there was significant trading activity in British Columbia as well. 142 APPENDIX B. SUPPLEMENT: M. TB DISPERSAL AND THE CANADIAN FUR TRADE

Table B.3: Estimated mutation rates for M. tuberculosis minisatellite loci, per locus and per transmis- sion generation; mutation rates on which the estimates are based are shown also, with appropriate units. Method Mutation rate used Estimate Transmission simulations Y. pestis and E. coli 0.00125 (per generation): 2.3 x 105 Comparison with RFLP RFLP (per year): 0.287 0.0074 (SK and QU) Comparison with RFLP As above 0.0115 (MIRU-VNTRplus database) θRF LP :θMIRU (QU) As above 0.0069 C

Supplemental Materials: Limited Evidence for Classic Selective Sweeps in African Populations

From Genetics. 2012 Nov; 192(3):1049-64.

C.1 Supplemental Methods

C.1.1 Samples

Table C.1 lists details of the samples used for each analysis. HapMap Maasai trio parents from Kinyawa, Kenya (MKK) (46 individuals) were included for population differentiation and haplotype analyses [121, 122, 123]. As Pemberton et al. [201] found evidence of undocumented relatedness between individuals in the Maasai, we removed a subset of the related individuals (NA21384, NA21475, NA21399, NA21365, NA21362, NA21382, NA21423, NA21453, NA21615, and NA21634), leaving a total of 46 Maasai individ- uals. For haplotype statistic calculations, SNPs with a missing genotype rate > 10%, SNPs with minor allele frequency < 0.5% (i.e. singletons), and SNPs out of Hardy-Weinberg equilibrium (assessed in each popula- tion independently) with a p > 0.001 were removed.

C.1.2 Population Differentiation

We first analyzed patterns of population differentiation of all autosomal SNPs for evidence of local adaptation driven by positive selection. Differences in allele frequencies between populations measured by statistics such as FST may have the power to detect selective events that have occurred less than 50,000 to 75,000 years in the past [4, 231].

We calculated a per-SNP FST over all 11 populations (all populations except the HapMap YRI; see Methods, Chapter4), a per-SNP derived allele frequency difference ( δ) between all pairs of populations, and mean FST averaged over all genotyped SNPs between all pairs of populations. For these calculations, several admixed individuals were removed based on their inferred proportions of ancestry in ancestral clusters (see Figure 1 of Henn et al. [103]). Excluded individuals in the Hadza were BAR09, END13, END15, END21, END22, and END07; in the 6=Khomani Bushmen, excluded individuals were SA25, SA59, SA39, SA58, SA45, SA49, SA40, and SA03 (leaving 11 Hadza and 23 6=Khomani Bushmen) (Table C.1). Enrichments of genic and non-genic SNPs (see Chapter4) were calculated for the absolute value of 11 derived allele frequency difference (|δ|) between pairs of populations for each pair separately (55, or 2 , pairs), as in Coop et al. [52] (using a bin size of 0.1). An enrichment in a given bin was calculated as the

143 144 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

proportion of genic (or nongenic) SNPs in that bin divided by the proportion of genic (or nongenic) SNPs among all genotyped SNPs. For a clearer picture, we counted the number of population pairs for which each given |δ| value was defined (some closely-related populations, for instance, may not have SNPs with a |δ| as high as 0.9.) We then counted the number of pairs for which, at each |δ| value, there appeared to be an enrichment of genic SNPs (an enrichment value > 1). We also averaged the enrichment values of genic and non-genic SNPs in each |δ| bin over all 55 population pairs as another summary of enrichment. Here, we examined the absolute value of |δ|, as when examining multiple population pairs, the polarization of a SNP is arbitrary. We then examined more closely population differentiation between several individual population pairs (out of the 55 total pairs). For each pair, we calculated δ (since we examined each pair individually, we did not take the absolute value) and assessed the significance of genic and non-genic enrichments using bootstrapping. Briefly, bootstrapping involved resampling 200kb regions of the genome with replacement, and re-calculating enrichment values in each bootstrap sample to generate confidence intervals (see Chapter 4). Over all 55 pairs of populations, we examined the relationships of the maximum (over all SNPs) per-SNP

FST and |δ| values between population pairs (as well as the top 99.99% tail values of FST and |δ|), versus mean pairwise FST . A best fit curve between the extreme FST (or |δ|) values and mean FST was drawn using the lowess function in R (http://www.r-project.org, as in Coop et al. [52]). We simulated the expected relationship under neutrality between the top 99.99% tail of |δ| values and mean FST using the “beta-binomial” method of Balding [11] (also used in Coop et al. [52]). The model assumes that a diverged pair of populations were once part of a common ancestral population, with the ances- tral frequency of some allele equal to p. In the model, the allele frequency in each population follows a beta distribution with mean p and variance p(1 − p)FST , where FST is that between the two populations. Within each population, the sampled allele frequencies follow a binomial distribution, with the success probability parameter determined by the value drawn from the beta distribution. Since the ancestral allele frequency of each SNP is unknown, we first sampled an ancestral allele frequency p for each SNP from a uniform distri- bution on [0, 1] (as in Coop et al. [52]). We then simulated the sample allele frequencies in each population using the beta-binomial distribution and this value of p to calculate |δ|. We repeated this simulation of |δ| for the number of SNPs in our dataset to obtain a simulated value of the 99.99% tail (see Chapter4 Methods).

C.1.3 Choice of Several Haplotype Statistic Parameters

We used several haplotype statistics to search for patterns of selective sweeps: the Integrated Haplotype Score (iHS) [291], and the Cross Population Extended Haplotype Homozygosity Test (XP-EHH) [233]. For the XP-EHH, we used the HapMap CEU, HapMap YRI, HapMap MKK, and 6=Khomani Bushmen as reference populations. For the results presented in Chapter4, we binned the genome into 100 kb non-overlapping windows, and assigned to each a statistic based on the per-SNP haplotype statistics within that window. For XP-EHH, C.1. SUPPLEMENTAL METHODS 145

the window statistic used was the maximum XP-EHH value within the window (as in Pickrell et al. [207]), shown to be a powerful summary statistic to identify selection. For iHS, we used the proportion of SNPs in that window with |iHS| > 2 (as in Voight et al. [291]). We assigned empirical p-values to windows based on their numbers of SNPs by binning windows in increments of 10 SNPs per window. A p-value for a given window w was obtained as the proportion of windows within that bin with a statistic greater than or equal to the statistic of window w. We assessed the sensitivity of results to the size of genomic windows used, to the bin sizes (in SNPs per window) for assigning p-values to windows, and to the XP-EHH reference population. Our selection of these parameters is described in detail below.

100 kb vs. 200 kb windows size

We studied two sizes for genomic windows: 200 kb, following Pickrell et al. [207], and 100 kb, following Voight et al. [291] and due to the lower linkage disequilibrium (LD) in African populations. To explore how strongly the choice of window size affected which windows appeared in the most extreme tails of the empirical distribution, we compared the top empirical windows for those obtained using each of the two window sizes. We examined iHS and XP-EHH using HGDP Europeans as a reference population (as in Pickrell et al. [207]). (Note that although this particular reference population was not used for our main results, these analyses of window size should be insensitive to the reference population used.) We performed calculations only for the 6=Khomani Bushmen, Sandawe, and Hadza, since genome-wide scans for selection in these populations have not been conducted previously. We obtained an empirical p-value for each window based on the statistic value of each window and accounting for the number of SNPs within each window. For both 100 kb and 200 kb windows, p-values were obtained by binning windows in increments of 20 SNPs (as in Pickrell et al. [207]). Windows with ≥ 100 SNPs were combined into 1 bin, as few genomic windows contained ≥ 100 SNPs. The top 1% of genomic windows were compared for each population between the method using 200 kb windows and the method using 100 kb windows. As expected, the top 1% of 100 kb windows contained approximately twice as many windows as the top 1% of 200 kb genomic windows. The proportion of 200 kb windows that did not overlap with at least one 100 kb genomic window was near 0% for XP-EHH, and was ≈40% for the iHS. The proportion of 100 kb windows without an overlap in at least one 200 kb window (in other words, windows not identified by the 200 kb approach) was ≈30-50% for XP-EHH, and ≈40-70% for the iHS. Results did not change greatly for XP-EHH when we examined the top 0.1% or top 5% of genomic windows. When only the top 0.1% of windows were examined for iHS, the percentage of 100 kb windows that did not overlap a 200 kb window increased to ≈80%. The results from XP-EHH indicate that selection signals can be localized to 100 kb within nearly all 200 kb windows (indicated by the ≈0% of 200 kb windows without an overlapping 100 kb window). The results from iHS suggest that this statistic is more influenced by the choice of window size; this is likely because the window statistic is the fraction of SNPs in a window with |iHS| > 2 (rather than the maximum value of 146 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

the statistic, as it is for XP-EHH). Because of the lower levels of LD typically found in African populations [103], we chose the smaller size of 100 kb.

Binning by 20 vs. 10 SNPs per window

Another parameter of interest is the most appropriate bin size (in numbers of SNPs) within which to calculate empirical p-values for genomic windows. As in the previous analysis, we first binned windows in increments of 20 SNPs per window, combining all windows with ≥ 100 SNPs into one window. As an alternative, we also binned windows in increments of 10 SNPs per window; for these bins, all windows with ≥ 50 SNPs were combined into one window, because few 100 kb genomic windows contain ≥ 50 SNPs. Again, the only populations for which we calculated these statistics were the 6=Khomani Bushmen, Sandawe, and Hadza, and we used HGDP Europeans as a reference for XP-EHH (as previously). For each population, the top 1% of genomic windows using increments of 20 SNPs per bin were compared with those using increments of 10 SNPs. The binning strategy does seem to have an effect on the empirical tail windows. For both XP-EHH and iHS calculations, ≈10% of the top 1% genomic windows from the 20 SNP binning method did not overlap with any of the top 1% of windows obtained using the 10 SNP method, and vice versa. With increments of 20 SNPs per window, a majority of windows were in the lowest SNP frequency bin (0-20 SNPs), while with increments of 10 SNPs the number of 100 kb windows per bin was more consistent across bins. Thus, we decreased the SNP increment of bins to 10 SNPs per bin.

Selection of XP-EHH Reference Populations

Finally, we selected the most appropriate populations to use as reference populations for the XP-EHH statis- tic. Originally, we selected from the following reference populations: (i) all HGDP European populations, pooled together (152 individuals), as in Pickrell et al. [207]; (ii) HapMap CEU Trio Parents (88 individuals); (iii) HGDP Yorubans (21 individuals); (iv) HapMap YRI Trio Parents (100 individuals). The motivation for including both a European population and one of Yoruban origin was to search for selective sweeps of dif- ferent ages. With trio information, HapMap individuals have more reliable phase inferences than those of the HGDP. As with our other analyses of haplotype statistic parameters (window size and bin size), we compared the top 1% of 100 kb genomic windows for the XP-EHH between those obtained using different reference populations. Again, we calculated the statistics only for the 6=Khomani Bushmen, Sandawe, and Hadza. HGDP Europeans vs. HapMap CEU. We first compared the top 1% of windows for XP-EHH using HGDP Europeans as a reference population with XP-EHH using HapMap CEU as a reference population. For all populations, ≈20-30% of the top 1% of windows were unique when using HapMap CEU as a reference, while ≈25-35% of the top 1% were unique when using HGDP Europeans. Thus, while a majority of the windows did show consistent signals using either reference population, ≈25% of the top windows depended on the reference population used. C.1. SUPPLEMENTAL METHODS 147

HGDP Yorubans vs. HapMap YRI. When we compared the calculations using either HGDP Yorubans or HapMap YRI as reference populations, between 40-50% of windows in the top 1% were unique to the calculations using each reference population. The difference in empirical tails between calculations using each of the two Yoruban reference populations was much greater than the difference between calculations using each of the two European reference populations (described above); this could be due to differences in the quality of phase information. HGDP Europeans vs. Yorubans. We also assessed the overlap between the top 1% of windows obtained using HGDP Europeans as a reference population versus those using HGDP Yorubans as a reference pop- ulation. The overlap between the two methods was not high, ranging from 7-20% for all populations. As expected, using Europeans as a reference population resulted in different signals from those when using an African reference population. Due to greater confidence in phasing, we selected the HapMap CEU and HapMap YRI as the European and Yoruban reference populations, respectively. We also used the HapMap Maasai (MKK) and 6=Khomani Bushmen (KHB) as reference populations. For these two reference populations, we had no alternative popu- lation against which to compare results (as we did for the HapMap CEU vs. HGDP Europeans and HapMap YRI vs. HGDP Yorubans).

Variation of Statistics within Windows

To ensure that breaking the genome into windows (as done by Voight et al. [291] and Pickrell et al. [207]) was a valid approach, we determined whether the variance of statistics within 100 kb windows was less than the variance among randomly selected SNPs. We examined the XP-EHH statistic in the HapMap YRI using HapMap CEU as a reference (as conducted in our main study); similar results should hold for all other sampled populations and statistics. We used two methods to assess the validity of the windowing approach. First, we performed a Kruskal- Wallis test independently for each chromosome. We assigned to each SNP a categorical variable to indicate the identity of the 100 kb window in which it was located. We then determined whether there was a significant effect of window on the XP-EHH value of each SNP. We found the window effect to be significant, with a p-value of 0 for all chromosomes. We also used a permutation approach to assess whether the variance of statistic values among SNPs within the same window was significantly less than the variance among groups of SNPs that were not within the same genomic window. To do so, we obtained the variance of the SNP XP-EHH values within each window, and then averaged the variance over all windows (average variance = 0.0554). For a null distribution, we sampled groups of non-adjacent SNPs on the same chromosome, with replacement, to mimic windows chromosome- wide. Since each genomic window contained a different number of SNPs, we used the same distribution of numbers of SNPs per window when drawing random groups of SNPs. We then computed a variance within each random group of SNPs, and averaged the variances across all random groups to obtain one draw of the average variance under the null hypothesis. We repeated this procedure 1,000 times. Our observed average 148 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

variance within 100 kb windows was significantly less than all of the 1,000 draws of average variances under the null hypothesis (p-value = 0). Thus, from both analyses, it appears that breaking the genome into 100 kb windows effectively groups together SNPs with more similar statistic values, as desired.

C.1.4 Comparison of Empirical Tail Windows

As a summary, we obtained the five genomic windows with the lowest empirical p-values in each population, and their p-values in all other populations, for iHS, XP-EHH CEU, XP-EHH YRI, XP-EHH MKK, and XP- EHH KHB. To do so, we first merged adjacent windows in the top 10% of each population using the program mergeBed in the program suite BEDTools [215]. If a top window in one population overlapped with a window in at least the top 10% of another population, we noted its p-value. For each of iHS, XP-EHH CEU, YRI, MKK, and KHB, we performed a Mantel correlation between a

matrix of mean FST over all autosomal SNPs between all pairs of populations and matrices of (i) correlations of window statistics across all genomic windows between pairs of populations, (ii) overlap in the top 1% empirical tail of windows between pairs of populations, and (iii) overlap in the top 0.1%. We also performed Mantel correlations between a matrix of the correlations of window statistics across all genomic windows between pairs of populations ((i) above), and matrices of both (iv) the overlap in the top 1% between pairs of populations and (v) overlap in the top 0.1%. We made scatter-plots of these five comparisons for each statistic for a visual representation of the Mantel correlations (see Supplemental Results (Section )). We also performed Mantel correlations between a matrix of the average Bantu ancestry of each population pair (see Methods, Chapter4), with matrices of (vi) overlap in the top 1% tail between pairs of populations, and (vii) the overall correlation of window statistics between pairs of populations. Note that the Namibian San are not included in this analysis, as Bantu ancestral proportions for all individuals are not available in Henn et al. [103].

C.1.5 Neutral Coalescent Simulations

We performed neutral coalescent simulations of a three-population model and compared empirical tail win-

dows for iHS and XP-EHH between populations (see Figure C.12). We assumed an Ne in each population

of 12,000, and obtained divergence times under an equilibrium model given values of FST using the formula 1 t (1 − FST ) = (1 − ) , where t is the divergence time in generations [115]. Divergence time between the 2Ne reference population and populations 1 and 2 was 3900 generations (FST = 0.15); divergence times between

populations 1 and 2 were: 2529 (FST = 0.10), 1231 (FST = 0.05), and 241 (FST = 0.01) generations. We assumed a uniform mutation rate of 1.5 x 10−8 per bp per generation (see Schaffner et al. [238]), and a uni- form recombination rate of 10−8 per bp per generation. We also assumed uniform recombination to translate the msms results to physical and genetic positions, required as input for XP-EHH and iHS calculations.

An example command line for msms corresponding to FST = 0.10 between the two daughter populations C.1. SUPPLEMENTAL METHODS 149

(Populations 1 and 2 in Figure C.12) is below. The simulation is for one 10-Mb segment, where population “1” is the reference population, and populations “2” and “3” are the two daughter populations. Scaling is in units of 4Ne generations. msms -ms 60 1 -t 7200 -I 3 20 20 20 0 -ej 0.053 3 2 -ej 0.081 1 2 -r 4800 10000000

C.1.6 Selective Sweep Coalescent Simulations

For simulations under positive natural selection, we simulated a selective sweep in the first daughter popu- lation, and allowed the parent population and second daughter population to evolve neutrally (all parameters remained as in the neutral simulations, see above and Chapter4). Segments of 5 Mb were simulated, since generally the signals of long haplotypes were contained within this region (the selected allele was in the cen- ter of each segment). In order to ensure that each simulated sweep was a hard sweep, we retained only the simulations for which the “origin count” of the selected allele was 1, and in which the selected allele was not lost. We ran preliminary analyses and simulations in msms to determine which parameters of selection were the most appropriate. First, we chose a selection coefficient of s = 0.2. Smaller selection coefficients resulted in a large number of unusable simulations (with < 5% of all simulations fitting the above criteria); ≈ 5-10% of all simulations fit the above criteria for s = 0.2. We examined several different timings for the start of selection of the allele. The first was 144 generations ago. For XP-EHH, this timing resulted in a very high true positive rate (see description of calculation below) for detecting selective sweeps, but for iHS, the true positive rate was much lower (see Supplemental Results (Section C.2)). This is because in nearly all simulations, the selected allele had reached fixation, resulting in lower power for iHS (see Results and Sabeti et al. [233], Voight et al. [291]). For these simulations, we simulated 80 5-Mb segments under selection, and 260 10-Mb segments under neutrality (for a total genomic length of 3000 Mb). Through additional simulations, we found that a more recent selection time of 62.4 generations ago led to a more intermediate final frequency (10-90%) of the selected allele, allowing iHS to have greater power [233]. As a result, the true positive rate for both iHS, as well as XP-EHH, was high with this timing (see Supplemental Results (Section C.2)). For these simulations, we simulated 300 5-Mb segments (50% of the simulated “genome”) to ensure that we simulated enough genomic regions with selection; the remaining 1500 Mb were simulated under neutrality. The selection options in msms that were used to simulate selection in the first daughter population (62.4 generations ago) are below (where timing is in units of 4Ne generations; arguments are used in conjunction with the demographic parameters, specified above, modified to simulate 5 Mb): -SI 0.0013 3 0 0.00001 0 -Sp 0.5 -Sc 0 2 9600 4800 0 -N 12000 -oOC -oTrace -Smu 0 As for the neutral coalescent simulations, for each of the 100 genome-wide simulations we calculated iHS 150 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

and XP-EHH, binned the genome into windows of 100 kb, and ranked windows to obtain empirical tails. To examine the performance, or true positive rate, of the iHS and XP-EHH statistics, for a given empirical tail we calculated the proportion of its 100 kb windows that were within a selective sweep (non-neutral) segment. This rough measure indicates the ability of the statistics to detect signals of selection in 100 kb windows that are within 2.5 Mb of a selected site (since the selected site was located in the center of each simulated 5-Mb region). Depending on the exact lengths of haplotype signals produced from the sweeps, this may be an overestimate of the true positive rate, and an underestimate of the false positive rate.

C.1.7 Gene Ontology Enrichment

Replication of Voight et al. 2006 Results

We followed the procedure of Voight et al. [291], who searched for Gene Ontology (GO) term enrichment in the HapMap YRI for the iHS statistic. First, we obtained the 50 closest SNPs centered around each gene (obtained from the UCSC refFlat mappings of RefSeq genes to hg18; downloaded in May 2010, http: //www.genome.ucsc.edu). If there were more than 50 SNPs located within a gene, we kept all of the SNPs. Then, we assigned to each gene a summary statistic, calculated as the proportion of its 50 SNPs with normalized |iHS| > 2. We ranked the genes in order of their summary statistics, and examined enrichments in the top 10% of genes. We assigned GO terms to genes using the biomaRt package in R (http://www.r-project.org; described in more detail below). We also pruned terms to include only those present in the PANTHER database (http://www.pantherdb.org), as Voight et al. [291] use this dataset. Note that in our anal- ysis, we examined ≈ 500,000 SNPs; this is only a subset of the SNPs analyzed by Voight et al. [291]. Then, we assessed enrichments for each PANTHER biological process GO term using the hypergeometric distribution. The hypergeometric distribution for a random variable X can be written as:

r n−r  k m−k P (X = k) = n  . (C.1) m

Here, for a particular GO term, k is the number of genes in the top 10% of genes with that GO term, r is the number of genes annotated with the GO term in the genome overall, n is the total number of genes in the genome, and m is the number of genes in the top empirical 10%. To calculate a p-value for a particular GO term, given its count of occurrences in the top empirical 10% of genes (k), we calculate P (X ≥ k) by summing P (X = i) for i ∈ [k, r] if r ≤ m and i ∈ [k, m] if m ≤ r.

Haplotype Statistics

We explain our novel permutation approach for assessing significance of GO terms within populations in more detail, as in Figure 4.1 of Chapter4. The method is particularly designed for analyses of genomic windows, within which several genes can exist. Significance is assessed for the top x% of 100 kb genomic C.1. SUPPLEMENTAL METHODS 151

windows for a particular haplotype statistic (iHS or XP-EHH); in our analyses, we examine the top 0.1%, 1%, and 5%. (We focus on the top 0.1% in the Chapter4.)

1. Observed empirical tails. For the top x% of genomic windows, we determined the GO terms associ- ated with the genes in each genomic window. (The biomaRt package in R was used to perform these lookups, using “ensembl_mart_51” to ensure that the genomic coordinates (Build 36) were consistent). Adjacent genomic windows were merged, so that a signal > 100 kb would have each of its GO terms counted only once. (This helps to reduce possible dependence between 100 kb windows.) For each unique GO term, we counted its number of occurrences.

2. Genome-wide windows. We determined the associated GO terms for each genomic window contain- ing at least one SNP with a statistic value (using biomaRt, as above). (Some SNPs were removed in the XP-EHH and iHS calculations due to certain criteria. For instance, for iHS, the minor allele fre- quency of each SNP must be ≥ 5%; Voight et al. [291] also stated, “if the region spanned by EHH > 0.05 reached a chromosome end or the start of of a gap > 200 kb, then no iHS value was reported for the core SNP.” Similarly for XP-EHH, Sabeti et al. [233] stated, “if there is no SNP with such an EHH between 0.03 and 0.05, the XP-EHH test is skipped.” Thus, not all genotyped SNPs resulted in a calculated XP-EHH or iHS value.)

We then binned the above windows by the number of SNPs per window, as done when calculating p- values for the observed data (described in section “Choice of Several Haplotype Statistic Parameters”).

3. Randomization step. In each SNP count bin, we randomly selected x% of windows, and combined the top windows from all bins to create a “random top x%” set of windows. Then, for each unique GO term, we recorded its number of occurrences across all of the windows in the “random top x%.”

We repeated this randomization step 20,000 times, recording for each GO term its number of occur- rences in each of the randomizations to form the null distribution.

4. P value calculation. For each GO term G, we calculated a p-value for over-representation of G in the observed x% tail as the proportion of the 20,000 permutations for which the number of occurrences of G is greater than or equal to the observed number of occurrences.

5. Multiple hypothesis testing correction. To account for multiple hypothesis testing, we controlled the false discovery rate (FDR) at a 5% level using the procedure of Benjamini and Hochberg [26]. This is less conservative than the Bonferroni correction, but more conservative than using no multiple hypothesis testing correction.

In some cases, we analyzed only a subset of all GO terms that are found in the PANTHER database, as in Voight et al. [291]. This only affects our assignment of GO terms to windows in steps 1 and 2 above. 152 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

Top FST Values

To assess whether any GO terms were enriched among SNPs with the most extreme values of FST (calculated across all 11 populations), we first followed the approach described previously for genomic windows. For each SNP, we found the gene, if any, in which it was located, and obtained the GO terms associated with that gene. We then randomly selected x% of all SNPs, and counted the occurrence of each GO term; we repeated this 20,000 times to obtain a null distribution for the counts of each GO term. We obtained p-values for observed counts of GO terms from this null distribution with an FDR correction. To be more conservative when identifying over-represented terms among top genomic windows, we restricted our results to terms associated with extreme FST SNPs occurring in different, not the same, genes. In addition, we searched for significant GO terms with the program GREAT [169], which uses a binomial test to assess significance among a foreground set of genomic regions (in our case, SNPs with large FST values). We used the SNPs genotyped in our dataset as a background set of genomic regions to control for non-uniform SNP density and ascertainment. Using the parameters described below, GREAT first infers gene regulatory domains, and then the GO terms associated with a given genomic region based on the domains it overlaps. It calculates the percentage of the genome annotated with each GO term; then, it counts the SNPs in the tails that are associated. A binomial test over the genomic regions is performed to obtain a p-value for each GO term, using an FDR correction (see McLean et al. [169] for more details). (This program was not used for 100 kb genomic windows, as the genomic regions of GREAT are required to be smaller). GREAT allows for a number of “association rule” parameters, which set the boundaries of gene regula- tory regions. We first used the default settings of the “basal plus extension” rule, allowing each gene to be associated with a basal regulatory domain a minimum distance of 5 kb upstream and 1 kb downstream of the transcription start site (TSS). This domain is extended both upstream and downstream until the nearest gene’s basal domain, and no more than 1000 kb away from the gene; any SNPs within this domain are associated with the gene. We then allowed the domain to be extended no more than 50 kb or 0 kb away (instead of 1000 kb).

C.2 Supplemental Results

C.2.1 Population Differentiation

Pairwise FST estimates averaged over all autosomal SNPs agreed with those of Henn et al. [103], who calculated FST between inferred ancestral populations (Table C.2). On average, our estimates were slightly lower, as we included admixed individuals with ancestry in several ancestral populations. We calculated for each SNP the absolute values of derived allele frequency difference (|δ|) between 11 all (55, or 2 ) pairs of populations, following Coop et al. [52]. Not all population pairs exhibited high |δ| values (Figures C.1A and C). For some pairs of populations, there was an enrichment of genic SNPs at high |δ| values, but for others, there was no enrichment (Figure C.1A). Among the population pairs for C.2. SUPPLEMENTAL RESULTS 153

which which genic SNPs were enriched among the highest |δ| values for the pair were the Kenya Bantu and Maasai; the South African Bantu and the 6=Khomani Bushmen and Sandawe; the Hadza and the Maasai; the Mbuti and the: Hadza, Maasai, Mandenka, Namibian San, 6=Khomani Bushmen, Sandawe, and Yoruba; and the 6=Khomani Bushmen and Yoruba. Interestingly, all population pairs had SNPs with allele frequency differences in the smallest bin (0 < |δ| < 0.1), with an apparent raw genic enrichment of SNPs for all (Figure C.1C). Though this enrichment of genic SNPs may not be significant in all populations, it may indicate the action of purifying selection. When enrichment values in each |δ| value bin (in Figure C.1A) were averaged over all population pairs, there was only a slight increase in enrichment of genic SNPs at extreme values (Figure C.1B).

We examined several of the population pairs individually for evidence of genic enrichment using a boot- strapping procedure to assign significance (mentioned in the Supplemental Methods (Section C.1) and Chap- ter4) and present results in Figure C.2. For individual population pairs, we examined the non-absolute value of δ. In most cases, the SNPs in the most extreme δ value bins were indeed genic – but there were very few (< 5) SNPs. In the bootstrapping procedure, we found that this small number of SNPs with extreme δ values was problematic. In general, less than 95% of the 1,000 bootstrap simulations actually contained a SNP with a value of δ in the highest (or lowest) bin (Figure C.2). Thus, the actual existence of SNPs in this bin was not significant, and so we could not assign significance to genic enrichment in these bins. (In Figure C.2C and D, while > 95% bootstrap samples were valid, enrichment was not significant). Interestingly, the next-most-extreme δ bin was never significant according to the 95% bootstrap confidence intervals. While it may be interesting that a majority of the most strongly differentiated SNPs were genic (and that the few bootstrap samples that actually contained these SNPs showed an enrichment), we cannot state with statistical certainty that there is evidence of genic enrichment of strongly differentiated SNPs among pairs of African populations.

We also examined the SNPs with |δ| > 0.8 in at least one population pair. In most cases, we found that these SNPs were primarily due to one allele being completely fixed or absent from the Mbuti Pygmies or Hadza. However, these high-|δ| SNPs between the two populations did not show a genic enrichment (Figure C.2D). The lack of genic enrichment among the many highly differentiated SNPs is likely due to the fact that they have both undergone severe bottlenecks in recent history [103].

We also analyzed population differentiation between the HGDP Tuscans and Yorubans (Figure C.3). Here, while we found that the bootstrap confidence intervals were valid, genic enrichment was not significant for either population. See Chapter4 for discussion of these results.

Although the analyses here (see also Figure C.4) and in Chapter4 (Figure 4.2 and 4.3) indicated that allele frequency differentiation is potentially strongly affected by population history, there were several interesting genes within which several SNPs lied in the empirical top 1% of all FST values. Of these, several SNPs were within the lactase gene (LCT), involved in lactase persistence, in several genes known to be involved in skin pigmentation (HERC2, HPS2, KIT, MITF, NF1, OCA2, PTGER3), and in other genes known to be 154 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

involved in disease resistance (HLA, LARGE, TLR5, TLR6). However, as differentiation of these SNPs may not necessarily be due to positive selection (see Chapter4 and above), we did not explore these signals further.

C.2.2 Haplotype Statistics

Overlap of Windows Between Populations

There was considerable overlap between populations of top windows for the XP-EHH CEU statistic (Figure C.6), but less overlap for iHS (Figure C.5), and even less for XP-EHH MKK, YRI, and KHB (Chapter4 Figure 4.4 and Figures C.7 and C.8). Note that our results for the iHS and XP-EHH CEU were slightly different from those of Pickrell et al. [207] and Henn et al. [103], as they examined genomic windows of 200 kb.

We examined the relationship between mean FST and extent of sharing of the top 0.1% of windows be- tween pairs of populations (see Supplemental Methods (Section C.1)). We saw results similar to the top 1% (see Chapter4 Results) for the XP-EHH CEU, XP-EHH MKK, and XP-EHH KHB, although the relationship between these two variables appeared weaker than for the top 1% of windows (Figure C.9). Mantel corre- lations remained significant for these statistics, but not for the iHS or XP-EHH YRI (Table C.3). (Since the maximum number of windows shared in the top 0.1% for any two populations was 1 for the iHS and XP-EHH YRI, there may be little power to detect any relationship.) For each of iHS, XP-EHH CEU, YRI, MKK, and KHB, we also computed a correlation of the selection statistics per window across all genomic windows between all pairs of populations (denoted “correlation of window statistics”). For all five statistics, including the XP-EHH YRI, the number of windows in the top 1% of both populations of a pair increased as the correlation of window statistics for that pair increased, with a significant Mantel correlation between these two variables (Table C.3, Figure C.10). For the top 0.1% of genomic windows, XP-EHH CEU, XP-EHH MKK, and XP-EHH KHB still exhibited significant relationships between the correlation of window statistics and the number of shared windows in the top 0.1%, though to a lesser degree than for the top 1% (Table C.3). For the iHS and XP-EHH YRI, no more than 1 window was shared between any pair of populations, again resulting in non-significant Mantel correlations (Table C.3).

We also found that more closely related population pairs (those with lower mean FST ) had more highly correlated window statistic values for iHS, XP-EHH CEU, XP-EHH MKK, and XP-EHH KHB (all except for XP-EHH YRI). Mantel tests for these four statistics indicated a significant correlation between the two variables (Table C.3). This negative trend between FST and correlations of window statistics suggests that values of these selection statistics in the genome as a whole correspond closely with population divergence. However, the XP-EHH YRI statistic appeared to behave in a fundamentally different manner from the other statistics. We observed no relationship between FST and the correlations of window statistics between pairs of populations, nor between FST and the extent of overlap in empirical tails. Since the empirical tails did seem to be related to overall correlations of window statistics, however, the different behavior of C.2. SUPPLEMENTAL RESULTS 155

XP-EHH YRI may not necessarily indicate that between-population patterns are driven by shared selective pressures. Interestingly, we found that the correlations of window statistics across the whole genome appeared to be related to the average proportion of Bantu ancestry. The correlation of window statistics between pairs of populations for XP-EHH YRI decreased as the average proportion of inferred Bantu ancestry for each population in the pair increased (p-value = 0.009; see Figure C.11 and Table C.3). We found a similar result for sharing in the 1% empirical tail (Table C.3). The greater amount of sharing of windows between populations with low amounts of Bantu ancestry appears to be an artifact of using a population with a large amount of Bantu ancestry as a reference.

Further Examination of Shared Windows

Population history appeared to be difficult to separate from the extent of sharing of candidate signals (see previous section). However, positive selection on the same genomic regions might be able to explain an increased number of shared windows beyond a “baseline” expected from shared ancestry. We examined the pairs of populations that appeared to have a greater than expected number of windows shared in their top

0.1% when compared to other population pairs with similar values of FST (Figure C.9). The Mandenka and Yoruba were an outlying population pair for the XP-EHH MKK, with five common windows in their top 0.1%. Of these, two adjacent regions on chromosome 11 (9,800,000-10,100,000) were outliers only in the Mandenka and Yoruba and the South African Bantu. This region contains the gene SBF2, mutations in which are known to be associated with Charcot-Marie-Tooth disease (characterized by abnormal myelin sheaths) [309]. For the XP-EHH YRI, the Sandawe, Hadza, and Maasai all shared a region on chromosome 2 (136,300,000-136,400,000) containing the genes MCM6, LCT and DARS. This region was also found in at least the top empirical 5% of the Mbuti and Biaka Pygmies, as well as the 6=Khomani Bushmen. Because this region appeared to be significant in hunter-gatherer, pastoralist, and agricultural populations, sharing may not be due to convergent adaptation of lactase persistence but by another unknown selective pressure or by a sweep at another gene, as suggested by Tishkoff et al. [276]. A region shared in the top 0.1% of the Kenyan Bantu and Maasai for XP-EHH YRI was on chromosome 2 (38,800,000-38,900,000), and was otherwise found only in the top 5% tail of the Mandenka. Among other genes, this region contains GALM, involved in galactose metabolism. Thus, a potentially diet-related trait could be under selection in all three of these agricultural/pastoralist populations (Kenyan Bantu, Maasai, Mandenka). There were several outlying population pairs for the XP-EHH MKK (Figure C.9). For the Mandenka and Yoruba, among the five shared windows in their top 0.1% was a window that was not in the top empirical 5% of any other population except the Kenyan Bantu; however, it contains no known genes (chromosome 13: 56,900,000-57,000,000). Two adjacent windows on chromosome 3 (56,500,000-56,700,000) were not in the top empirical 5% of any other tested population. This region contains the genes CCDC66 and C3orf13, whose functions are unknown. For iHS, one pair with a shared window was the Sandawe and Maasai; this window (chromosome 11: 84,600,000-84,700,000) was not an outlier in any other population and contains DLG2, potentially involved 156 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

in synapse function. The 6=Khomani Bushmen and Maasai shared an empirical outlying region for XP-EHH YRI, found only also in the empirical tail windows of the Hadza. This region on chromosome 2 (196,900,000- 197,000,000) contains the gene HECW2, possibly related to ubiquitination.

Coalescent Simulations

For data simulated under the neutral coalescent, in nearly all empirical tails for XP-EHH, there appeared to be a strong relationship between the FST of the two daughter populations and the extent of sharing of windows in their empirical tails (Figure C.13). For iHS, this relationship was less strong, though still apparent (Figure C.14). For data simulated under selection for XP-EHH, there was generally no relationship between the number

of overlapping windows and FST , regardless of timing of selection (Figures C.15 and C.17). In Figure C.17, where selection is less recent, there was a more apparent relationship between the two quantities (especially for the 0.001 tail). While our calculated true positive rate was still 100%, this observation may be due to the fact that our estimate of the true positive rate is an overestimate (see Supplemental Methods (Section C.1)). This observation may also be due to the inadequate number of regions simulated for this selection timing (80 5-Mb regions, versus 300 5-Mb regions for the more recent selective sweeps in Figure C.15). For data simulated under selection for iHS, there was a more apparent relationship of the number of

overlapping tail windows with FST for older selective sweeps (Figure C.18) than for recent sweeps (Figure C.16). The true positive rate for iHS was nearly 100% for more recent sweeps (62.4 generations ago), where the selected alleles had not yet swept to fixation. However, for older sweeps (144 generations ago), the selected allele had generally risen to fixation, resulting in a drop of power [233, 291], and decrease of the true positive for iHS to as low as 60%. This caused the empirical tails to generally contain more neutral false positive genomic windows. This existence of false positive signals in empirical tails likely explains

the apparent relationship between extent of sharing of empirical tail windows with a population pairs’ FST (Figure C.18).

C.2.3 Gene Ontology Enrichments

Replication of Voight et al. 2006 Results

We replicated the procedure of Voight et al. [291] to assess enrichment of Gene Ontology (GO) terms in the top 10% of genes according to iHS calculations in the HapMap YRI. Voight et al. [291] reported enrich- ment for steroid metabolism (GO:0008202), MHC-I mediated immunity (GO:0032393 and GO:0032394), chemosensory perception (GO:0007606), olfaction (GO:0007608), and peroxisome transport (GO:0043574). Repeating their analysis with their methodology, we also found enrichments for MHC-I mediated immunity (p-value = 0.002) and olfaction (p-value = 0.02625). Although we did not find an enrichment for chemosen- sory perception, this is likely due to a change in the GO database (http://amigo.geneontology. org, accessed June 2011): surprisingly, we found very few genes in the GO database associated with C.2. SUPPLEMENTAL RESULTS 157

chemosensory perception. We also found a higher p-value for steroid metabolism than do Voight et al. [291] (p-value = 0.6684). This is again likely due to the GO database: while we found many genes associated with the term when we look up the term in the database, the term was not associated with any of them when the genes were looked up in the database (as done for our analysis). Finally, the term peroxisome transport was not in the PANTHER database, though Voight et al. [291] reported it to be. Thus, most differences in results when replicating the method of Voight et al. [291] were likely due to changes in the GO database since their study. In addition, we examined only a subset of the SNPs analyzed by Voight et al. [291], resulting in slightly different p-values. While we found several other significant terms when using the method of Voight et al. [291], none of them were significant when using our more rigorous permutation method (described in the Supplemental Methods (Section C.1) and in Chapter4 Methods and Results).

Haplotype Statistics

Using our permutation method, we found that all haplotype statistics (iHS, and XP-EHH CEU, YRI, MKK and KHB) exhibited at least some signatures of GO term enrichment. In addition, GO terms appeared to be enriched in all empirical tails for the statistics (the top 0.1%, 1%, and 5%) (Table C.4 and Chapter4 Tables 4.1 and 4.2). However, terms that appeared enriched in a less extreme tail (i.e. top 1%) were not necessarily enriched in a more extreme tail (i.e. top 0.1%), potentially complicating the choice of the most appropriate cutoff. Our neutral simulations also highlight that if an improper empirical tail is analyzed, attributing sharing of signals between populations to positive selection could be largely inaccurate (see also Figures C.13 and C.14). In Chapter4 (Tables 4.1 and 4.2), we present results from the top 0.1%, motivated by the results of Table C.3.

FST Values

We performed a similar Gene Ontology (GO) term enrichment analysis for SNPs with the top empirical val- ues of FST (calculated across all 11 populations). We first used a permutation method, analogous to the one performed for genomic windows and haplotype statistics, but found no significant results (see Supplemental Methods (Section C.1)). We also used the program GREAT [169] to search for enrichment of terms associated with SNPs with the largest values of FST . Our results varied depending on the “extension” rules determining the association of a SNP with a particular gene (see Supplemental Methods (Section C.1)). When we allowed the domain associated with a gene to extend up to 1000 kb away beyond the basal domain, we found five sig- nificant GO terms with multiple gene associations: regulation of saliva secretion (GO:0046877), regulation of ATPase activity (GO:0043462), positive regulation of ATPase activity (GO:0032781), tricarboxylic acid cycle (GO:0006099), and coenzyme catabolic process (GO:0009109). However, when we allowed an exten- sion of only 50 kb, none of these terms were significant, and instead we found associations of a different nature: spermatogenesis (GO:0007283), regulation of estrogen receptor signaling pathway (GO:0033146), chromatin modification (GO:0016568), regulation of cytokine biosynthetic process (GO:0042089), gamete 158 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

generation (GO:0007276), muscle filament sliding (GO:0030049), atrial cardiac muscle tissue morphogen- esis (GO:0055009), and male sex determination (GO:0030238). When we required that a SNP be located within a gene in order to be associated with it (0 kb extension) we found no enriched GO terms, consistent with our permutation analysis. Since the GREAT analysis depended heavily on certain parameters, and our permutation procedure found no evidence of over-representation for any GO terms, there did not appear to be any particular functions associated with SNPs that were strongly differentiated between African populations (as measured by FST ). This agrees with Figures 4.2 and 4.3 of Chapter4 and Figures 4.2, C.2, and C.4, which suggest that selection is not a primary driver of extreme allele frequency differentiation between African populations for the SNPs in our dataset. C.3. SUPPLEMENTAL FIGURES AND TABLES 159

C.3 Supplemental Figures and Tables

A 3.0 2.5 ● ● ●

2.0 ● ● ● ●

● ● ● ● ● ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Enrichment ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.5 ● ● ● ● Genic ● Nongenic ● ● ● ● 0.0 0.0 0.2 0.4 0.6 0.8 1.0 δ B 3.0 2.5 2.0 1.5

● ●

Enrichment ● ● ● ● ● ● ● ● ● ●

1.0 ● 0.5 Genic Nongenic 0.0 0.0 0.2 0.4 0.6 0.8 1.0 δ C

70 Pairs with Defined Value Pairs with Genic Enrichment (>1) 60 ● ● ● ● ● ● ●

50 ● 40

● ● ● 30 ● ●

20 ● ●

● 10 Number of Population Pairs Number of Population

● ● 0 0.0 0.2 0.4 0.6 0.8 1.0 δ

Figure C.1: Enrichment of genic and non-genic SNPs for values of |δ| between pairs of populations. En- richments of genic and nongenic SNPs are calculated within each |δ| value bin (width 0.1). (A) Enrichments calculated separately for each pair of populations among the 11 populations (55 pairs in total). Each line indicates a unique population pair. (B) Enrichments of genic and non-genic SNPs averaged over population pairs – i.e, an average of the lines in panel (A), for genic and non-genic enrichments separately. (C) Summary of panel (A). Black line indicates the number of population pairs with a defined value of |δ| for each bin (i.e., the number of population pairs where there exists at least one SNP in the given |δ| bin). Red line indicates the number of population pairs where there appears to be an enrichment of genic SNPs (enrichment value > 1) in the given |δ| bin. 160 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

A Kenyan Bantu − Maasai B Khomani Bushmen − Yoruba 3.0 3.0

2.5 0.631 0.87 2.5 0.855 0.855 2.0 2.0 1.5 1.5 ichment ichment En r En r 1.0 1.0 0.5 0.5

Genic Genic Nongenic Nongenic 0.0 0.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 δ δ

C Mbuti − Namibian San D Hadza − Mbuti 3.0 3.0

2.5 0.953 2.5 2.0 2.0 1.5 1.5 ichment ichment En r En r 1.0 1.0 0.5 0.5

Genic Genic Nongenic Nongenic 0.0 0.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 δ δ

Figure C.2: Enrichment of genic and non-genic SNPs for values of δ between selected pairs of popula- tions. Enrichments of genic and nongenic SNPs are calculated within each δ value bin, as in Figure 4.2 and Chapter4 (width 0.1). Bootstrap confidence intervals are indicated by dotted lines (see Chapter4). SNPs where the derived allele is fixed in the first population and absent in the second are depicted on the right side of the plot (near 1); SNPs where the derived allele is fixed in the second population and absent in the first are depicted on the left side of the plot (near -1). The number above the most extreme δ bins indicates the proportion of bootstrap simulations where an enrichment value was defined for the given bin (see Results (Chapter4)); if no value is listed, the proportion is 1. Values less than 0.95 can be considered insignificant. (A) Kenyan Bantu and Maasai. (B) 6=Khomani Bushmen and Yoruba. (C) Mbuti and Namibian San. (D) Hadza and Mbuti Pygmies. C.3. SUPPLEMENTAL FIGURES AND TABLES 161

Yoruba − Tuscan 3.0 2.5 2.0 1.5

Enrichment ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1.0 ●

● ● ● 0.5

Genic Nongenic 0.0

−1.0 −0.5 0.0 0.5 1.0

δ

Figure C.3: Enrichment of genic and non-genic SNPs for values of δ between the HGDP Yoruba and HapMap Tuscans. Enrichments of genic and nongenic SNPs are calculated within each δ value bin, with bootstrap confidence intervals, as in Figures 4.2 and C.2 and Chapter4. SNPs where the derived allele is fixed in the Yorubans and absent in the Tuscans are depicted on the right hand side of the plot (near 1); SNPs where the derived allele is fixed in the Tuscans and absent in the Yorubans are depicted on the left hand side of the plot (near -1). 162 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● 0.7 ●

Max Delta ● ● ● ● ● ● 0.6 ●

0.05 0.10 0.15

Mean Fst

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0.9 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.7 ● ● ● ● Max Fst ● ● ● ●●

0.5 ●

0.05 0.10 0.15

Mean Fst

● ● ● ● ● 0.9 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●

0.7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 99.99% Fst 0.5 ● ● ● ● ● ● ● 0.3 0.05 0.10 0.15

Mean Fst

Figure C.4: Extreme values of SNP differentiation over all SNPs vs. mean FST for all pairs of popu- lations. Each point is a population pair; there are 55 pairs in total. Line is the best-fit lowess curve through the points. Top panel Y-axis: maximum value over all SNPs of derived allele frequency difference (|δ|) be- tween pairs of populations; Middle panel Y-axis: maximum value over all SNPs of FST between pairs of populations; Bottom panel Y-axis: upper 99.99% tail cutoff value over all SNPs of FST between pairs of populations. C.3. SUPPLEMENTAL FIGURES AND TABLES 163

chrm begin end Biaka Mbuti Hadza Sandawe Namib. San Khomani SA Bantu Ken. Bantu Maasai Mandenka Yoruba Genes chr18 40900000 41100000 9.13E-05 NONE NONE 0.001771 NONE NONE 0.02083722 0.02374101 8.01E-02 2.23E-02 0.005048 SETBP1 chr7 66500000 67000000 1.83E-04 0.085318 NONE NONE NONE NONE 0.01748089 0.0323741 NONE 0.05097721 0.021826 chr3 29300000 30200000 0.000249 NONE NONE NONE 0.02847 NONE 0.08655025 0.03009592 0.03639515 0.02231551 0.015369 RBMS3 chr12 18700000 19000000 0.000274 NONE NONE 0.03081 NONE NONE NONE NONE NONE 0.06336125 NONE PLCZ1,CAPZA3 chr6 18100000 19200000 0.000365 0.003322 NONE NONE NONE NONE NONE 0.08921143 NONE 0.01774295 NONE MIR548A1,KDM1B,NHLRC1,TPMT... chr5 76600000 76700000 NONE 9.49E-05 NONE 0.000197 NONE NONE NONE NONE NONE NONE NONE PDE8B chr21 46400000 46800000 0.002054 1.90E-04 NONE NONE NONE 0.06726844 NONE 0.04548743 0.03841794 NONE NONE C21orf56,LSS,MCM3AP,PCNT... chr13 42800000 43200000 NONE 0.000285 0.019878 NONE NONE NONE NONE NONE NONE NONE 0.032331 ENOX1 chr20 15200000 15300000 NONE 0.000303 NONE NONE NONE NONE 0.06117455 NONE NONE 0.06230667 0.01694 MACROD2 chr4 16500000 16900000 NONE 0.00038 NONE NONE 0.00548 0.09366779 0.07099571 NONE NONE NONE 0.02468 LDB2 chr11 17400000 19200000 0.021129 0.038124 9.51E-05 NONE 0.093875 0.07475279 NONE NONE NONE NONE NONE SAAL1,IGSF22,TSG101,SPTY2D18... chr11 20000000 21200000 NONE 0.038124 1.90E-04 0.06646 NONE 0.03823823 NONE NONE 0.03856563 0.07819687 NONE HTATIP2,NELL1,NAV2,PRMT3... chr11 4300000 5100000 NONE 0.058094 0.000259 NONE 0.087104 0.06208203 NONE NONE 0.0067659 0.02596111 0.011891 OR51D1,OR51S1,OR52A4,OR52K1... chr2 26800000 30100000 0.004838 0.028896 0.000285 0.018922 0.007785 0.00312982 0.0591299 0.02868339 0.05051014 0.02334997 0.062664 PPP1CB,ZNF513,CCDC121,PLB1... chr11 3700000 4200000 0.005888 0.078201 0.00038 NONE NONE 0.02503856 NONE NONE 0.03681959 0.07047645 0.099348 NUP98,STIM1,RHOG,PGAP2,RRM1 chr7 124000000 124700000 0.040031 0.022887 NONE 9.08E-05 0.07989 0.00629034 NONE 0.00185386 NONE 0.00180131 0.037901 GPR37,POT1 chr11 84500000 84800000 NONE 0.010113 NONE 1.82E-04 NONE NONE NONE 0.09926855 0.00018205 0.07047645 0.099705 DLG2 chr12 44900000 45700000 0.035924 5.70E-02 0.090735 0.000318 0.023484 0.02503856 0.00778482 NONE 0.01914599 0.00121589 0.041139 SLC38A4,SLC38A2,SLC38A1 chr5 42900000 43300000 NONE NONE 0.04456 0.000318 NONE 0.01369863 0.03967972 0.04508882 0.00646277 0.02571377 0.002898 LOC153684,MGC42105,ZNF131... chr19 38600000 38700000 0.024603 NONE NONE NONE 9.65E-05 1.09E-03 NONE NONE NONE NONE NONE PEPD chr14 83400000 83700000 0.033433 NONE 0.05954 NONE 1.93E-04 0.01632949 NONE 0.00076867 NONE 0.08661069 0.068647 chr6 21800000 22200000 0.015976 NONE NONE NONE 3.08E-04 NONE NONE NONE 0.0494897 NONE NONE FLJ22536 chr15 49800000 49900000 NONE NONE NONE NONE 0.000338 0.08436905 NONE NONE NONE NONE NONE SCG3,TMOD2,LYSMD2 chr4 170300000 170600000 NONE NONE NONE 0.002229 0.000338 0.01818924 NONE NONE NONE NONE NONE SH3RF1,NEK1 chr6 27000000 27900000 0.026383 0.019598 NONE 5.29E-02 0.041185 9.07E-05 0.02024021 0.06588297 NONE 0.02265154 0.045599 ZNF204P,FKSG83,C6orf41,PRSS16... chr14 53500000 53700000 NONE NONE NONE NONE NONE 1.81E-04 NONE NONE NONE NONE NONE chr3 172900000 173100000 NONE NONE 0.099274 NONE NONE 2.23E-04 NONE NONE NONE 0.06336125 NONE PLD1,TMEM212 chr6 30100000 30800000 0.0875 0.023156 NONE 0.027947 0.003426 0.00027216 0.01236655 NONE 0.02398507 0.00551978 0.017949 TRIM31,C6orf136,NCRNA00171... chr2 69600000 69700000 0.053683 NONE NONE 0.036936 NONE 0.00036288 NONE 0.09861639 NONE NONE NONE AAK1,SNORA36C chr8 132700000 133800000 NONE NONE 0.093114 0.013477 0.012014 0.03442801 9.32E-05 NONE NONE NONE 0.068647 TMEM71,HPYR1,HHLA1,EFR3A... chr11 42300000 43000000 0.018121 NONE 0.055783 0.004855 0.07989 0.01632949 1.86E-04 0.07505878 0.020071 0.02265154 0.028754 chr10 111900000 112200000 NONE 0.018608 NONE NONE NONE NONE 2.80E-04 NONE NONE NONE NONE SMNDC1,MXI1 chr12 39000000 39500000 NONE NONE NONE 0.019681 NONE NONE 0.00030637 NONE NONE NONE NONE CNTN1,LRRK2 chr3 60900000 61200000 0.021362 NONE 0.078039 0.047554 NONE 0.01369863 0.00037293 NONE NONE 0.05097721 0.057553 FHIT chr5 162000000 162200000 NONE NONE NONE NONE NONE NONE 0.01491702 9.04E-05 0.00373202 NONE 0.002898 chr14 77000000 78300000 4.84E-03 NONE 0.05954 0.019648 0.023883 NONE NONE 2.26E-04 5.42E-02 3.73E-02 0.068647 SNW1,ADCK1,C14orf156,AHSA1... chr5 149900000 150900000 0.099069 0.078201 NONE NONE NONE 0.02844053 NONE 0.00022608 NONE 0.01332973 NONE TNIP1,GM2A,SLC36A1,NDST1... chr6 165900000 167100000 0.02958 0.023298 0.004851 NONE NONE NONE NONE 0.00023981 0.02740189 0.08274414 0.051604 T,MIR1913,PDE10A,C6orf176,SFT2D1... chr14 80100000 80700000 0.068357 0.030227 0.008893 NONE 0.016935 0.01369863 NONE 0.00036173 NONE NONE 0.006619 TSHR,C14orf145 chr2 136000000 136900000 NONE 0.088974 0.064105 0.00186 NONE 0.00426381 NONE NONE 9.10E-05 NONE NONE ZRANB3,LCT,MCM6,CXCR4,R3HDM1,MIR128-1,UBXN4,DARS chr2 134200000 135200000 NONE NONE NONE 0.023224 NONE 0.00490415 0.03538132 NONE 1.93E-04 0.04882899 0.032084 TMEM163,LOC151162,MGAT5 chr1 148800000 149200000 0.085155 0.088974 NONE NONE 0.00548 NONE NONE NONE 0.00031859 0.09506378 0.001404 ADAMTSL4,ARNT,HORMAD1... chr4 60700000 61000000 NONE NONE 0.056051 NONE 0.015134 NONE NONE NONE 0.000319 NONE 0.057553 chr22 34300000 35000000 0.007944 NONE 0.016001 5.99E-02 NONE NONE NONE NONE 0.083512 9.01E-05 0.001268 APOL6,APOL1,APOL4,APOL3... chr7 150800000 151000000 NONE NONE NONE NONE NONE NONE 0.046662 NONE 0.067398 2.21E-04 0.040947 PRKAG2,RHEB chr11 124800000 125200000 1.07E-02 NONE 0.099439 0.004855 0.079658 0.013699 NONE 0.098616 9.63E-02 2.25E-04 0.068647 FEZ1,EI24,PKNOX2,PATE3,STT3A... chr14 35600000 35800000 NONE NONE 0.04456 NONE NONE NONE NONE 0.015012 0.051156 2.25E-04 0.018113 chr9 11700000 12000000 NONE NONE NONE NONE NONE NONE 0.01011 0.003118 NONE 0.000442 0.011891 chr22 29800000 30600000 0.085155 0.022887 0.023283 NONE NONE 0.05355 0.008897 NONE 0.038418 NONE 9.06E-05 LIMK2,C22orf30,INPP5J,RNF185... chr4 34500000 34600000 NONE 0.001946 0.076555 NONE NONE NONE NONE NONE NONE NONE 0.000181 chr11 9800000 10300000 1.47E-02 0.002325 0.001037 0.040884 NONE 0.07033 NONE NONE NONE 3.42E-03 0.000224 ADM,SBF2 chr5 178400000 178500000 NONE 0.030227 NONE 0.090162 NONE NONE NONE 0.006466 NONE 0.001216 0.000272 ZNF354C,ADAMTS2 chr1 200600000 201100000 NONE 0.036585 NONE NONE 0.001566 0.05355 0.035381 0.006466 0.004724 0.003197 0.000362 SYT2,PPP1R12B,KDM5B,LOC148709

Figure C.5: P-values for the 5 most extreme 100 kb genomic windows according to the iHS statistic in each population. Red indicates p < 0.01, orange indicates p < 0.05, yellow indicates p < 0.10, and white indicates p > 0.10. “Genes” column lists the genes located within the indicated windows. 164 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

chrm begin end Biaka Mbuti Hadza Sandawe Namib. San Khomani SA Bantu Ken. Bant Maasai Mandenka Yoruba Genes chr7 96400000 96600000 9.29E-05 0.002388 0.01208 0.000743 0.011514 0.03240182 0.002392681 0.001301962 2.78E-04 9.29E-05 9.29E-05 DLX5,ACN9,DLX6AS,DLX6 chr11 34200000 34400000 1.85E-04 0.002222 0.029063 0.002961 0.000185 0.002035529 0.040095751 0.047989624 0.010171999 0.076082932 0.0175893 ABTB2 chr16 26000000 26200000 0.000186 0.001395 0.027052 0.015874 0.004925 0.007613035 0.013489627 0.010973682 0.049081462 0.000464296 0.0010216 HS3ST4 chr13 1.08E+08 107800000 0.000279 0.002418 0.000837 0.007334 0.001208 0.016433015 0.002325798 0.007997768 0.017628503 0.005757266 0.0169035 TNFSF13B chr16 31300000 31400000 0.00028 0.010816 0.0899 NONE 0.022044 0.032623915 0.082899367 0.030648109 0.04994404 0.088932529 0.0766531 ITGAD,COX6A2,ARMC5,ITGAX... chr2 1.15E+08 115600000 NONE 9.30E-05 0.003533 NONE 0.064957 0.009469873 NONE NONE NONE NONE 0.0204328 DPP10 chr11 72100000 72600000 0.000421 1.40E-04 0.087478 0.006734 0.030202 0.000464209 0.094892548 0.020645401 0.023659306 0.053208283 0.0302777 FCHSD2,ARAP1,STARD10,ATG16L2 chr17 50600000 50800000 0.010737 0.000185 0.005368 0.020174 0.009996 0.048667654 0.017794254 0.000926441 0.035509525 0.009255831 0.0292538 HLF chr8 1700000 1900000 0.028369 0.000186 NONE NONE NONE NONE NONE NONE NONE NONE 0.0992908 ARHGEF10,MIR596,CLN8 chr20 37800000 37900000 0.028882 0.000279 4.49E-02 0.051801 0.00985 0.005013462 0.007256489 NONE 0.023102616 0.004642957 0.0017647 chr8 71000000 71200000 0.001207 0.003813 9.30E-05 0.000186 0.007155 0.002692415 0.005116755 3.81E-03 0.000185563 0.001857183 NONE PRDM14,NCOA2 chr2 1.12E+08 112200000 0.022151 0.001545 1.40E-04 0.007716 0.001264 0.008541025 0.017311752 0.029101645 0.027700056 0.041660822 0.0308859 LOC541471,BCL2L11 chr8 22800000 22900000 0.003147 NONE 0.000185 NONE 0.019252 NONE 0.086746988 0.042871757 0.004993527 0.013836011 0.0020367 PEBP4 chr1 56900000 57300000 0.009287 0.05663 0.000186 0.020052 0.011012 0.003713676 0.004279468 0.035617967 0.007236964 0.006685858 0.0295347 DAB1,C8B,C8A,C1orf168,PRKAA2 chr4 1.04E+08 103800000 0.033086 0.000843 0.000281 4.77E-02 0.067537 0.039484738 0.070372977 0.027555181 0.009093453 0.017674288 0.0301839 NFKB1,MANBA chr11 1.22E+08 122000000 0.005479 0.03608 0.011713 9.28E-05 0.000558 0.000835577 0.000744255 0.011159676 9.28E-05 0.010214505 0.0027863 chr19 5000000 5500000 0.002243 0.015311 0.001023 1.40E-04 6.04E-03 0.007560907 0.000140746 0.002530578 0.000139899 0.000140272 0.0001404 PTPRS,ZNRF4,KDM4B chr10 91900000 92100000 0.005924 2.78E-03 0.018686 0.000185 0.016105 NONE 0.001297498 0.000185288 0.000184945 0.000185117 0.0001852 chr4 1.7E+08 170100000 0.01588 2.79E-02 NONE 0.000278 NONE 0.001114103 0.08912457 0.026125625 0.059565782 0.089237627 0.0149531 PALLD chr1 90000000 90400000 0.019038 NONE 5.31E-02 0.022837 9.29E-05 9.28E-04 0.003163085 5.68E-02 0.035257005 0.000371437 0.0020433 ZNF326,GEMIN8P4,LRRC8D chr4 1.12E+08 112800000 0.03399 0.005338 0.020087 0.035462 1.40E-04 0.000280034 0.013303563 0.082209616 0.017164595 0.083573219 0.0425383 chr12 14000000 14200000 0.054978 0.025665 0.020824 0.014575 1.86E-04 9.28E-05 0.024653456 0.002045941 0.012896641 0.002414337 0.0088233 GRIN2B chr5 1.8E+08 180100000 0.005187 0.047479 0.000372 0.000281 0.000281 0.006860823 NONE 0.020525798 0.005176273 0.002805443 0.0008423 FLT4,OR2Y1,SCGB3A1,CNOT6 chr10 1.04E+08 104300000 0.011916 0.043405 NONE 0.022727 0.001685 0.000140017 0.011963406 0.013355827 0.023642977 0.015991023 0.0056156 MIR146B,ELOVL3,NFKB2,C10orf95... chr6 1.51E+08 151300000 0.019807 0.007593 0.009441 5.74E-03 0.001111 0.000185048 0.000185357 0.007967389 0.048085815 0.007404665 0.0029624 MTHFD1L,PLEKHG1 chr1 2.12E+08 211800000 0.021452 0.01339 0.029097 0.0013 0.00158 1.86E-04 0.000279096 0.01515856 0.00463908 0.010493082 0.0064085 chr5 1.11E+08 110700000 0.01328 0.002697 0.006321 0.051987 0.036892 2.79E-04 0.012094148 NONE 0.099461867 0.041508032 0.0874896 CAMK4 chr2 5400000 5700000 0.005015 0.010973 0.018128 0.000928 0.01998 0.008077244 9.30E-05 0.023900307 0.001484505 0.004085802 0.0098449 chr3 1.94E+08 194300000 0.028325 0.062581 NONE 0.003342 NONE 0.056726395 0.000186064 0.005765833 0.02857673 0.000185718 0.0042723 chr21 43100000 44100000 0.021103 0.039427 0.015804 0.00687 0.024719 0.028409618 1.01E-02 9.30E-05 0.007051401 0.001578605 0.0108665 CRYAA,LOC284837,RRP1,PKNOX1... chr7 1.55E+08 155000000 0.026472 0.025556 0.001124 0.012065 0.062482 0.009801176 1.74E-02 0.000140588 0.064353665 0.01781456 0.0463288 EN2,CNPY1 chr13 58600000 58700000 0.015695 0.013111 0.026959 0.0259 0.001487 0.035465602 3.72E-04 0.000185995 0.001206161 0.001485746 0.0059441 chr16 72900000 73300000 0.015788 0.008183 0.005764 0.008912 NONE 0.055705134 0.016838776 0.000278992 0.00176285 0.005571548 0.0062227 MLKL,LOC283922,CLEC18B,RFWD3... chr17 21100000 21300000 NONE NONE 0.030482 0.001403 0.014311 0.009941193 0.010978184 0.007310558 0.004196978 0.000280544 0.0116524 MAP2K3,KCNJ12 chr2 59000000 59200000 0.012166 0.011159 NONE 0.004734 0.013884 NONE 0.000930319 0.001766949 0.022824272 0.001207169 0.0001858 chr22 48800000 49100000 2.01E-02 0.029477 0.022869 0.028407 0.021342 0.022746263 0.004000372 3.67E-02 3.18E-02 6.50E-03 0.0002786 MAPK12,MLC1,TUBGCP6,PLXNB2... chr1 7200000 7400000 0.001486 NONE 0.018221 0.005941 0.002695 0.038119911 0.002232766 0.052450479 2.56E-02 0.011607392 0.0005573 CAMTA1

Figure C.6: P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH CEU statistic in each population. Red indicates p < 0.01, orange indicates p < 0.05, yellow indicates p < 0.10, and white indicates p > 0.10. “Genes” column lists the genes located within the indicated windows. C.3. SUPPLEMENTAL FIGURES AND TABLES 165

chrm begin end Biaka Mbuti Hadza Sandawe Namib. San Khomani SA Bantu Ken. Bantu Maasai Mandenka Genes chr9 2700000 2900000 9.25E-05 0.026732 NONE NONE NONE 0.019752631 NONE 0.098278734 NONE 1.85E-02 KCNV2,KIAA0020 chr16 31300000 31500000 1.39E-04 0.003053 NONE NONE 0.079429 0.053863196 NONE NONE NONE NONE C16orf58,COX6A2,ARMC5,ZNF843... chr5 1.47E+08 147200000 0.000185 NONE NONE NONE 0.027132 0.001292228 NONE 0.008677991 NONE NONE SPINK1,JAKMIP2 chr2 81800000 82100000 0.000185 NONE NONE NONE NONE NONE 0.085870269 NONE NONE NONE chr6 1.43E+08 143600000 0.000277 NONE 0.036626 NONE NONE NONE NONE NONE NONE NONE AIG1,HIVEP2 chr13 40800000 41100000 NONE 9.25E-05 NONE NONE 0.032853 0.024923844 NONE 0.007642073 NONE NONE OR7E37P,KIAA0564,NAA16,C13orf15 chr11 71600000 72000000 0.011663 1.39E-04 0.083241 0.050457 NONE NONE 0.099335303 0.031263026 0.03986711 NONE CLPB,PHOX2A,FOLR2,INPPL1... chr2 6300000 6500000 0.0048 0.000185 NONE 0.087872 NONE 0.089532952 NONE 0.005539143 NONE NONE chr4 66700000 66900000 0.091271 0.000185 NONE 0.097678 NONE 0.023659889 NONE NONE 0.071230342 0.038411699 chr12 16400000 16700000 0.002682 0.000277 NONE NONE 0.003976 0.012107209 NONE 4.00E-02 NONE 0.050721955 MGST1,LMO3 chr15 35300000 35600000 NONE NONE 9.25E-05 NONE NONE NONE NONE NONE NONE NONE chr4 63500000 64900000 NONE NONE 1.39E-04 0.011505 NONE NONE 0.047671994 NONE 0.001386194 0.06417558 TECRL chr14 64800000 67100000 0.00358 NONE 0.000185 0.027842 0.056109 NONE 0.012508687 0.080311241 0.047685057 0.084210526 PLEK2,MPP5,FAM71D,ATP6V1D... chr8 71000000 71200000 0.047346 NONE 0.000185 0.007585 NONE 0.081977819 0.013139632 NONE 0.010638298 NONE NCOA2,PRDM14 chr4 58500000 58700000 NONE NONE 0.000278 NONE 0.03088 NONE 0.08235403 0.002035906 0.016281221 NONE chr4 36400000 36600000 NONE NONE 0.082439 9.25E-05 NONE NONE NONE NONE NONE NONE chr5 24200000 24500000 NONE NONE NONE 1.39E-04 NONE NONE 0.032800556 NONE NONE NONE chr11 1.34E+08 134100000 NONE NONE NONE 0.000185 NONE NONE NONE 0.05465288 NONE NONE chr22 21900000 22000000 0.032643 NONE 0.038212 0.000185 0.097078 0.000369686 NONE 0.031464001 0.046623497 NONE FBXW4P1,BCR chr21 30900000 31200000 NONE 0.033854 NONE 0.000277 NONE NONE NONE NONE NONE NONE KRTAP7-1,KRTAP6-1,KRTAP21-3... chr1 9900000 10300000 0.016245 0.028171 NONE 0.060715 9.25E-05 0.098172251 NONE NONE NONE NONE LZIC,KIF1B,NMNAT1,UBE4B,RBP7 chr1 40300000 40700000 0.09664 0.046628 NONE 0.051151 1.39E-04 0.008122577 0.042842602 NONE NONE NONE ZNF643,PPT1,ZMPSTE24,TMCO2... chr4 67100000 67300000 0.001805 0.012212 NONE NONE 0.000185 0.005545287 2.21E-02 NONE NONE NONE chr8 75400000 75900000 NONE 0.019706 NONE 0.016187 0.000277 0.06155268 NONE NONE NONE NONE PI15,MIR2052,GDAP1 chr18 54200000 54500000 0.090825 NONE NONE 8.66E-02 0.000369 0.00276906 NONE NONE NONE 0.011106997 ALPK2,NEDD4L,MALT1,MIR122 chr22 40900000 41100000 0.063806 0.025437 0.0854 0.045139 0.02034 9.24E-05 0.047394024 NONE NONE NONE TCF20 chr4 85700000 86100000 NONE 0.05093 NONE NONE 0.00499 1.38E-04 0.025434329 NONE NONE NONE CDS1,WDFY3 chr9 3000000 3200000 0.070334 0.053535 NONE NONE 0.065338 0.000184604 0.009231905 NONE NONE 0.048384118 chr12 26000000 26100000 NONE NONE NONE NONE 0.006287 0.000184843 0.057092625 NONE NONE 0.089689004 RASSF8 chr6 56500000 57400000 NONE 0.026228 NONE NONE 0.020932 0.000276932 6.54E-02 0.067832686 NONE NONE RAB23,ZNF451,BEND6,PRIM2,DST... chr17 11500000 12400000 NONE NONE 0.046632 0.014971 NONE NONE 9.25E-05 NONE 0.014338575 0.030914476 MAP2K4,DNAH9,MIR744,ZNF18 chr18 18100000 18300000 0.032306 0.050397 0.095485 NONE 0.039129 NONE 1.85E-04 NONE NONE NONE CTAGE1 chr22 15800000 15900000 0.015813 0.014152 NONE NONE 0.033931 0.004528651 0.000185065 NONE NONE NONE CECR7,GAB4 chr17 64400000 64500000 NONE 0.057072 NONE NONE NONE NONE 0.000277598 NONE NONE NONE ABCA8,ABCA9 chr2 2.27E+08 227700000 NONE NONE NONE NONE NONE NONE 0.000277971 NONE NONE 0.043200445 RHBDD1,COL4A4,IRS1 chr22 16700000 16900000 NONE NONE NONE 0.045417 NONE 0.028059812 NONE 9.25E-05 NONE NONE FLJ41941,MICAL3 chr14 70800000 71300000 NONE NONE NONE NONE NONE NONE NONE 0.000138947 NONE NONE SIPA1L1,SNORD56B,LOC145474 chr5 39100000 39300000 NONE 0.062621 NONE 0.015138 NONE 0.09805915 NONE 0.000184638 0.004429679 NONE FYB,RICTOR chr1 1.86E+08 187000000 NONE NONE 0.030255 0.051289 NONE NONE NONE 0.000185082 0.003515264 0.071676622 chr7 1.56E+08 156700000 NONE NONE NONE 0.082879 NONE 0.055914972 NONE 0.000277624 0.012580944 0.01375191 UBE3C,MNX1,NOM1 chr2 1.35E+08 137700000 0.011525 0.009437 5.55E-04 0.000416 0.011095 0.004344 0.068937 1.25E-02 9.25E-05 NONE ZRANB3,ACMSD,LCT,RAB3GAP1... chr5 1.16E+08 116000000 0.053351 0.039535 NONE 0.022337 NONE NONE NONE NONE 0.000185 NONE SEMA6A chr2 38600000 39100000 NONE NONE NONE 0.023495 NONE NONE 0.000555 NONE 0.000185 0.048223 HNRPLL,DHX57,SOS1,SFRS7... chr2 1.97E+08 197000000 NONE NONE 0.001203 4.28E-02 0.036243 0.000277 0.066346 NONE 0.000463 NONE STK17B,HECW2 chr2 1.34E+08 134900000 NONE 0.057596 0.026588 0.004061 NONE NONE 0.034158 NONE 0.000554 NONE MGAT5 chr6 32300000 33100000 1.40E-02 NONE NONE NONE 0.002326 0.072833 NONE 0.011255 NONE 9.26E-05 HLA-DRB5,PSMB8,HLA-DQB1... chr2 1.55E+08 156200000 NONE NONE NONE NONE 0.03715 0.009415 0.062223 NONE NONE 1.39E-04 KCNJ3 chr7 1.51E+08 151000000 NONE NONE NONE NONE NONE NONE NONE NONE NONE 0.000185 NUB1,CRYGN,WDR86,RHEB... chr7 49200000 49600000 NONE NONE NONE NONE NONE NONE NONE 0.005367 NONE 0.000278 chr15 42800000 43100000 0.089883 0.065501 0.064027 NONE NONE 0.066081 0.082261 0.065444 0.067785 0.000278 TRIM69,C15orf43

Figure C.7: P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH YRI statistic in each population. Red indicates p < 0.01, orange indicates p < 0.05, yellow indicates p < 0.10, and white indicates p > 0.10. “Genes” column lists the genes located within the indicated windows. 166 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

San- Namib. SA Kenyan Man- chrm begin end Biaka Mbuti Hadza dawe San. Bantu Bantu Maasai denka Yoruba Genes chr9 7E+07 7.3E+07 ###### NONE NONE NONE NONE NONE NONE NONE NONE NONE chr5 1E+08 1.3E+08 ###### 0.0057 NONE NONE NONE 0.0872 0.0128 0.0411 0.0572 NONE SEC24A,SAR1B,PITX1... chr3 3E+07 3E+07 0.0002 NONE NONE 0.0253 NONE 0.0002 0.0057 0.0004 0.0028 0.0096 RBMS3 chr14 6E+07 5.6E+07 0.0002 0.0364 0.027 NONE NONE 0.0422 NONE 0.0233 NONE 0.0882 PELI2 chr9 3E+07 3.1E+07 0.0003 0.0243 0.0127 0.0443 NONE NONE 0.0098 0.0413 0.0163 0.007 chr20 5E+07 4.9E+07 0.0039 ###### NONE NONE NONE NONE NONE NONE 0.0584 0.0435 PTPN1,MIR645,BCAS4... chr2 2E+07 2.5E+07 0.0632 ###### 0.0049 0.0401 NONE 0.0006 0.0086 0.0173 0.0101 0.0064 C2orf84,UBXN2A... chr13 4E+07 4.1E+07 0.0349 0.0002 NONE NONE NONE 0.0617 0.0013 0.0328 0.0047 0.089 OR7E37P,KIAA0564... chr2 2E+07 2.1E+07 NONE 0.0004 0.0343 0.0592 0.0681 NONE NONE NONE NONE NONE RHOB,GDF7,HS1BP3... chr4 7E+07 6.7E+07 NONE 0.0004 NONE 0.0672 NONE NONE NONE NONE NONE 0.0597 chr15 4E+07 3.6E+07 NONE NONE ###### 0.048 NONE NONE NONE NONE 0.0934 NONE chr14 6E+07 6.7E+07 0.0233 NONE ###### 0.0207 0.0939 0.0019 0.0284 0.011 0.052 0.0354 PLEK2,MPP5,FAM71D... chr4 6E+07 6.5E+07 NONE 0.0939 0.0003 0.0137 0.0185 NONE NONE 0.0028 NONE NONE TECRL chr3 2E+08 1.5E+08 0.002 NONE 0.0004 0.045 0.07 0.0323 NONE NONE NONE 0.0284 C3orf79 chr5 2E+07 2.1E+07 0.0675 NONE 0.0004 ###### NONE 0.0617 NONE 0.0592 0.0595 NONE chr3 1E+08 1.2E+08 0.0045 NONE 0.0195 ###### 0.0601 NONE 0.0535 0.0399 NONE NONE ZBTB20 chr7 1E+08 1.3E+08 NONE NONE 0.0092 ###### NONE NONE 0.005 0.0007 0.0017 0.0015 CPA4,CPA2,CPA5 chr19 4E+07 4.4E+07 NONE NONE 0.0605 0.0002 NONE 0.0679 0.0005 0.0014 0.0002 0.0003 PPP1R14A,SPRED3... chr1 2E+06 2200000 0.0026 ###### 0.0097 0.0003 NONE 0.0046 0.0909 0.0041 0.0129 0.0522 SKI,C1orf86,PRKCZ... chr8 6E+07 5.8E+07 NONE NONE NONE 0.0003 NONE NONE 0.0006 ###### 0.0395 0.0405 chr2 2E+08 1.7E+08 NONE 0.0055 NONE 0.083 ###### NONE NONE NONE 0.0818 0.056 SP5,GAD1,LOC440925... chr2 2E+08 1.9E+08 NONE NONE NONE 0.0874 ###### NONE NONE 0.0581 NONE 0.0733 ZC3H15 chr10 1E+08 1.2E+08 NONE NONE 0.0749 NONE 0.0002 NONE NONE NONE 0.0658 NONE chr12 2E+07 2E+07 0.0399 NONE NONE NONE 0.0004 NONE NONE NONE NONE NONE chr2 1E+07 1.5E+07 NONE NONE NONE ###### 0.0004 NONE NONE NONE NONE NONE chr2 2E+07 2.5E+07 0.0673 NONE 0.018 0.0543 NONE ###### 0.0722 0.0382 0.0165 0.0008 NCOA1,CENPO,C2orf79... chr3 1E+08 1.4E+08 NONE NONE 0.0843 NONE NONE ###### 0.0426 NONE NONE 0.041 IL20RB chr1 1E+07 1.5E+07 0.0784 NONE NONE NONE NONE 0.0004 NONE 0.0801 NONE NONE chr17 1E+07 1.2E+07 NONE NONE NONE 0.0338 NONE 0.0004 NONE 0.0245 0.0816 NONE MAP2K4,DNAH9,MIR744... chr11 1E+08 1.1E+08 NONE 0.0068 0.0365 0.0236 0.0894 0.0714 ###### 0.0005 0.0005 0.0074 C11orf92,POU2AF1... chr7 1E+08 1.1E+08 0.0145 0.0517 NONE 0.0522 0.0444 0.0657 ###### 0.0407 0.0811 0.0097 SLC26A3,GPR22,COG5... chr14 7E+07 7.1E+07 NONE NONE NONE NONE 0.0354 0.0901 ###### NONE NONE NONE SIPA1L1,SNORD56B... chr4 3E+07 2.6E+07 0.0485 0.0697 0.0116 0.0009 NONE NONE 0.0003 0.0003 0.0063 0.0073 chr1 1E+08 1.2E+08 0.0557 NONE 0.0227 NONE NONE NONE 0.0004 0.0011 NONE 0.0061 FAM46C chr2 1E+08 1.4E+08 NONE 0.0509 0.0158 0.0432 0.003 NONE NONE ###### NONE NONE ZRANB3,ACMSD,LCT,DARS... chr5 1E+08 1.2E+08 ###### 0.0804 NONE 0.01 NONE NONE 0.0428 ###### ###### NONE SEMA6A chr1 2E+08 2.2E+08 0.0005 0.02 0.0029 0.0295 NONE 0.0063 0.0008 0.0002 ###### 0.0108 chr16 2E+07 2.3E+07 0.0607 NONE 0.01 0.0315 NONE 0.0009 0.0144 0.0814 ###### 0.0004 HS3ST2,USP31 chr19 5E+07 4.8E+07 0.0131 NONE NONE 0.0384 NONE 0.0164 0.023 0.0696 ###### 0.0037 GSK3A,ERF,CEACAM5,CIC... chr4 0 300000 0.0247 0.0967 NONE 0.005 NONE NONE 0.0011 0.0123 ###### 0.0019 ZNF876P,ZNF732,ZNF718... chr11 1E+07 1E+07 NONE 0.092 ###### 0.0481 NONE 0.0024 0.0216 ###### ###### 0.0001 SBF2,ADM chr17 3E+06 3700000 0.011 0.0202 0.0597 0.0012 NONE 0.0016 0.0004 ###### 0.0007 ###### TRPV1,TAX1BP3,CTNS... chr12 8E+07 7.9E+07 0.0081 NONE 0.0656 0.0075 0.0817 0.0438 0.0076 0.0015 0.0072 ###### PPP1R12A,MIR1252,SYT1... chr19 4E+07 4.3E+07 0.0054 0.0992 NONE ###### NONE 0.0526 0.0032 0.0058 0.0754 0.0026 ZNF420,ZNF570,ZFP30... chr1 2E+08 1.6E+08 ###### NONE 0.0699 0.011 NONE 0.0083 0.0149 0.008 ###### ###### OR10J5,DARC,FCER1A...

Figure C.8: P-values for the 5 most extreme 100 kb genomic windows according to the XP-EHH KHB statistic in each population. Red indicates p < 0.01, orange indicates p < 0.05, yellow indicates p < 0.10, and white indicates p > 0.10. “Genes” column lists the genes located within the indicated windows. C.3. SUPPLEMENTAL FIGURES AND TABLES 167

iHS XP−EHH CEU

2.0 NS 6 ● ● SWE/MKK *** MDK/YOR

5 ●● ●●●● 1.5 4 ● ●

SWE/YOR 1.0 ● ● ● 3 ● ● ● ● ● SWE/MKK SNB/BKN HDZ/SNB 2 ● ● ● ●● ● ● 0.5 1 ●● ●● ● ● ● ●● ● ●●● ● ● ● # Overlapping Windows (0.1%) Windows # Overlapping (0.1%) Windows # Overlapping

HDZ/KHB 0.0 ●● ●●●● ●●● ● ●● ● ●●●● ●●● ●●●●●●●●●●● ●●●●● ●● ● ● ● ● 0 BPY/BSA ● ● ●● ● ●●●● ● ●● ● ●● ● 0.05 0.10 0.15 0.05 0.10 0.15 Mean Fst Mean Fst XP−EHH YRI XP−EHH MKK

2.0 NS 5 ● MDK/YOR **

4 1.5

3 BSA/MKK KHB/MKK HDZ/SWE 1.0 ● ● ● ● ● SWE/MKK SWE/KHB HDZ/MKK 2 ● BSA/YOR ● BPY/SWE

0.5 SNB/KHB 1 ● ● BPY/KHB # Overlapping Windows (0.1%) Windows # Overlapping (0.1%) Windows # Overlapping

0.0 ●● ●● ● ●●●●● ●● ● ●●● ● ●●●●●●●● ●●●●● ● ● ● ● 0 ●● ● ● ● ●● ●● ● ●● ●● ●●●● ●●●● ●●●● ●● ● ● ● ● 0.05 0.10 0.15 0.05 0.10 0.15 Mean Fst Mean Fst XP−EHH KHB

7 ● MDK/YOR *** BPY − Biaka Pygmy MPY − Mbuti Pygmy 6 HDZ − Hadza 5 SWE − Sandawe SNB − Namibian San ● ● ● SWE/BKN 4 KHB − Khomani Bushmen

BSA − Bantu South Africa 3 ● BKN − Bantu Kenya

2 ● ● ●●● MKK − Maasai MDK − Mandenka 1 ● ● ● ● ● ●● ● # Overlapping Windows (0.1%) Windows # Overlapping YOR − Yoruba SWE/BSA MKK/YOR 0 ●● ● ● ● ●●●●●●●●● ●●●●● ●● ● ● ● 0.05 0.10 0.15 Mean Fst

Figure C.9: Number of shared 100 kb genomic windows in the top empirical 0.1% vs. mean FST for pairs of populations. Each point is a different population pair (some points are labeled, see key). Titles indicate the haplotype statistics for which the shared windows occur. Line is a best-fit lowess curve through the points. Significance of the p-value for the Mantel correlation between the x and y variables is indicated in the upper right corner: “NS”: not significant, “*”: < 0.05, “**”: < 0.01, “***”: < 0.001. 168 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

iHS XP−EHH CEU

*** MDK/YOR ● *** MDK/YOR ● 100 BKN/MKK ● 40 ● ●

80 ●● ● ● 30 ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● 60 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● 10 ● ●● ● 40 ● ● ● ● MKK/KHB # Overlapping Windows (1%) Windows # Overlapping ● ● ●● ● (1%) Windows # Overlapping ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● BSA/HDZ ● ● ● ● ●

0.00 0.10 0.20 0.30 0.70 0.75 0.80 0.85 0.90 Correlation of Window Statistics Correlation of Window Statistics XP−EHH YRI XP−EHH MKK

SNB/KHB ● ● 60 *** *** SNB/KHB 40 MDK/YOR● 50

40 30 BPY/SNB●

30 ● ● MKK/SWE ● 20 ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● BPY/KHB 10 ● ● ● ● ● ● ● ● ● ● ● # Overlapping Windows (1%) Windows # Overlapping 10 ● ● ● (1%) Windows # Overlapping ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● 0 ●MPY/HDZ 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 Correlation of Window Statistics Correlation of Window Statistics XP−EHH KHB

MDK/YOR *** ● BPY − Biaka Pygmy 60 ● BKN/MKK MPY − Mbuti Pygmy

● HDZ − Hadza 50 BSA/YOR ● ● SWE − Sandawe

● ●SWE/MKK ● SNB − Namibian San 40 ● ● KHB − Khomani Bushmen

30 BSA − Bantu South Africa ● ● ● ● BKN − Bantu Kenya ● ● ● 20 ● ●●● ● ● ● MKK − Maasai ● ● ● ● MDK − Mandenka ● # Overlapping Windows (1%) Windows # Overlapping ● 10 ● ●● ● YOR − Yoruba ●●● ●● ●● ● ● ● 0 0.2 0.3 0.4 0.5 0.6 0.7 Correlation of Window Statistics

Figure C.10: Number of shared 100 kb genomic windows in the top empirical 1% vs. correlations of window statistics over all genomic windows for pairs of populations. Each point is a different population pair. Titles indicate the haplotype statistics for which the shared windows occur. Line is a best-fit lowess curve through the points. Significance of the p-value for the Mantel correlation between the x and y variables is indicated in the upper left corner: “NS”: not significant, “*”: < 0.05, “**”: < 0.01, “***”: < 0.001. C.3. SUPPLEMENTAL FIGURES AND TABLES 169

iHS XP−EHH CEU

0.30 ● ● NS NS● 0.90

● ● 0.25 ● ●● ● ● ● ● ● ● ● ● ● ● 0.85 ● ● ● ● 0.20 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.15 ● ● ● ● ● ● 0.80 ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ●● ● ● ● ● ● ● ● ●● ● 0.75 ● ● ● ● ● ● ● 0.05 ● ● ● ● ● ●

Correlation of Window Statistics Correlation of Window 0.00 Statistics Correlation of Window 0.70

● ●

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

Average Bantu Anc Average Bantu Anc XP−EHH YRI XP−EHH MKK

● SWE/MKK ** 0.60 * ●

0.5 0.55

● BPY/KHB ● ● ● ● 0.50 ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● 0.45 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● 0.35 ● ● ● ● ●● ● ● ● ● ●

Correlation of Window Statistics Correlation of Window ● Statistics Correlation of Window ● ● ● 0.30 ● 0.2 ● ● ● ● 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

Average Bantu Anc Average Bantu Anc XP−EHH KHB

NS ● ● 0.7 BPY − Biaka Pygmy ● MPY − Mbuti Pygmy ● ● ● ● HDZ − Hadza ● ● ● ● 0.6 SWE − Sandawe ● ● ● KHB − Khomani Bushmen ● ● ● ●● ● ● ● ● BSA − Bantu South Africa 0.5 BKN − Bantu Kenya ● ● ● ● MKK − Maasai ● ● MDK − Mandenka 0.4 ● ● ● ● ● ● Correlation of Window Statistics Correlation of Window YOR − Yoruba

0.0 0.2 0.4 0.6 0.8

Average Bantu Anc

Figure C.11: Correlations of window statistics over all 100 kb genomic windows vs. average Bantu ancestry for pairs of populations. Average proportion of Bantu ancestry is inferred from Henn et al. [103], see Methods (Chapter4). Each point is a different population pair. Titles indicate the haplotype statistics for which the correlations are calculated. Line is a best-fit lowess curve through the points. Significance of the p-value for the Mantel correlation between the x and y variables is indicated in the upper right or left corner: “NS”: not significant, “*”: < 0.05, “**”: < 0.01, “***”: < 0.001. Note that the Namibian San are not included, as Bantu ancestral proportions for all individuals are not available from Henn et al. [103]. 170 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

FST = 0.15

0.01 FST = 0.05 { 0.10

Reference Population Population 1 2

Figure C.12: Demographic model simulated using msms. FST values determine the divergence times in the model; see Chapter4. C.3. SUPPLEMENTAL FIGURES AND TABLES 171

0.05 0.01 70 600 60 500 ● ● 50 400 40 300

● 30 ● 20

200 ● ● 10 # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping 100

0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

0.005 0.001 25

● 2.5 ● 20 15 1.5

10 ● ● 5 0.5 ● ● # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping

0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.13: Number of overlapping 100 kb windows in the empirical tails of XP-EHH for pairs of populations simulated under neutrality. Means +/- standard errors over 100 simulations are shown for each value of FST simulated between Populations 1 and 2 (see Figure C.12). The empirical tail examined in Populations 1 and 2 is shown above each plot. 172 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

0.05 0.01 120 6 5 100

● ● ● ● ● 4 80 ● 3 60 # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping

0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

0.005 0.001 1.8 1.6

● 0.14

1.4 ● ●

1.2 ●

0.10 ● ● 1.0 0.8 # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping 0.06

0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.14: Number of overlapping 100 kb windows in the empirical tails of iHS for pairs of popula- tions simulated under neutrality. Means +/- standard errors over 100 simulations are shown for each value of FST simulated between Populations 1 and 2 (see Figure C.12). The empirical tail examined in Populations 1 and 2 is shown above each plot. C.3. SUPPLEMENTAL FIGURES AND TABLES 173

0.01 0.001 0.30 5 4 0.20

● ● ● 3

0.10 ● ● ● 2 # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping 0.00 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst 1.10 1.10

● ● ● ● ● ● 1.00 1.00 0.90 0.90 0.80 0.80 Proportion in Tails Selection Windows Proportion in Tails Selection Windows 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.15: Number of overlapping 100 kb windows and true positive rate in the empirical tails of XP-EHH for pairs of populations simulated under recent selection. Strength of selection s = 20%, time of selection equal to 62 generations ago. The empirical tail examined is shown above each column, and the x-axes indicate the value of FST simulated between Populations 1 and 2 (see Figure C.12). Top row y- axis: number of windows overlapping between the empirical tails of populations 1 and 2 (means +/- standard errors over 100 simulations). Bottom row y-axis: as an estimate of true positive rate, the proportion of 100 kb windows in the empirical tails of population 1 that fell within one of the 5 Mb regions in which a selective sweep was simulated (+/- standard deviation over 100 simulations). 174 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

0.01 0.001 0.30 5 0.20 4

● ● ● 3 0.10

● ● ● 2 # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping 0.00 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst 1.1 1.1

● ● ● ● ● ● 0.9 0.9 0.7 0.7 0.5 0.5 Proportion in Tails Selection Windows Proportion in Tails Selection Windows 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.16: Number of overlapping 100 kb windows and true positive rate in the empirical tails of iHS for pairs of populations simulated under recent selection. Strength of selection s = 20%, time of selection equal to 62 generations ago. The empirical tail examined is shown above each column, and the x-axes indicate the value of FST simulated between Populations 1 and 2 (see Figure C.12). Top row y-axis: number of overlapping 100 kb windows; bottom row y-axis: true positive rate. See caption for Figure C.15. C.3. SUPPLEMENTAL FIGURES AND TABLES 175

0.01 0.001 0.30 3.0 2.5 0.20

● 2.0 ● ● ● 0.10 1.5

● ● # Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst 1.10 1.10

● ● ● ● ● ● 1.00 1.00 0.90 0.90 0.80 0.80 Proportion in Tails Selection Windows Proportion in Tails Selection Windows 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.17: Number of overlapping 100 kb windows and true positive rate in the empirical tails of XP-EHH for pairs of populations simulated under less recent selection. Strength of selection s = 20%, time of selection equal to 144 generations ago. The empirical tail examined is shown above each column, and the x-axes indicate the value of FST simulated between Populations 1 and 2 (see Figure C.12). Top row y-axis: number of overlapping 100 kb windows; bottom row y-axis: true positive rate. See caption for Figure C.15. 176 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

0.01 0.001 0.30 3.0 2.5 ● 0.20

2.0 ●

● 0.10 1.5 ● ● 1.0

# Overlapping 100kb Windows # Overlapping 100kb Windows # Overlapping ● 0.00 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

● ● 1.0 1.0 0.8 0.8 ● ● ● ● 0.6 0.6 0.4 0.4 Proportion in Tails Selection Windows Proportion in Tails Selection Windows 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10

Fst Fst

Figure C.18: Number of overlapping 100 kb windows and true positive rate in the empirical tails of iHS for pairs of populations simulated under less recent selection. Strength of selection s = 20%, time of selection equal to 144 generations ago. The empirical tail examined is shown above each column, and the x-axes indicate the value of FST simulated between Populations 1 and 2 (see Figure C.12). Top row y-axis: number of overlapping 100 kb windows; bottom row y-axis: true positive rate. See caption for Figure C.15. C.3. SUPPLEMENTAL FIGURES AND TABLES 177

Table C.1: Populations analyzed. Population Source Phased? FST iHS XP-EHH XP-EHH XP-EHH XP-EHH CEU YRI MKK KHB Kenyan HGDP Yes X X X X X X Bantu S. African HGDP Yes X X X X X X Bantu Biaka HGDP Yes X X X X X X Hadza Henn et al. Yes X X X X X X [103] Maasai HapMap Yes X X X X REF X Mandenka HGDP Yes X X X X X X Mbuti HGDP Yes X X X X X X Namibian HGDP & Yes X X X X X X Bushmen Schuster et al. [242] 6= Khomani Henn et al. Yes X X X X X REF Bushmen [103] Sandawe Henn et al. Yes X X X X X X [103] Yoruba HGDP Yes X X X REF X X (test) YRI HapMap Yes X X REF X X CEU HapMap Yes REF Tuscans HapMap No X All Euro- HGDP Yes REF peans (test) “Phased?” column indicates whether phased data were used (for haplotype analyses). An “X” indicates whether the population was used in the analyses of FST (and allele frequency differentiation), iHS, XP-EHH CEU, XP-EHH YRI, XP-EHH MKK, or XP-EHH KHB. “REF” indicates that population was used as a reference population. HGDP Europeans and HGDP Yorubans were only used as test reference populations for XP-EHH (see Supplemental Methods (Section C.1)). 178 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS acltdoe l uooa Ns l ouain r nlddecp o h aMpYI ic h GPYrbn r xmndinstead. examined are Yorubans HGDP the since YRI, HapMap the for except included are populations All SNPs. autosomal all over Calculated Tuscan Yoruba Sandawe Bushmen 6 = Bushmen Namibian Mbuti Mandenka Maasai Hadza Biaka Bantu African S. Khomani 0.135 0.009 0.021 0.073 0.091 0.073 0.015 0.016 0.100 0.039 0.008 Bantu Kenyan 0.147 0.008 0.025 0.055 0.073 0.067 0.015 0.025 0.105 0.032 Bantu African S. al C.2: Table 0.172 0.040 0.050 0.064 0.074 0.056 0.046 0.053 0.120 Pygmy Biaka Mean Hadza 0.180 0.106 0.085 0.133 0.153 0.148 0.108 0.085 F ST Maasai 0.095 0.026 0.015 0.071 0.090 0.078 0.028 ausbtenpiso populations. of pairs between values Mandenka 0.146 0.009 0.033 0.080 0.097 0.081 0.193 0.076 0.075 0.081 0.092 Pygmy Mbuti 0.198 0.093 0.082 0.018 San Namibian 0.172 0.077 0.065 Bushmen 6 = Khomani Sandawe 0.108 0.030 Yoruba 0.148 C.3. SUPPLEMENTAL FIGURES AND TABLES 179 0.05). < Overlap Top 1% 0.468 (0.029) 0.490 (0.043) -0.531 (0.0004) 0.443 (0.023) 0.527 (0.021) Average Bantu Ancestry and Correlation of Win- dow Statistics 0.300 (0.175) 0.400 (0.123) -0.643 (0.008) 0.455 (0.043) 0.495 (0.063) ST F -0.669 (0.003) -0.913 (0) -0.243 (0.164) -0.747 (0) -0.784 (0.003) Overlap Top 0.1% 0.223 (0.057) 0.637 (0.0004) 0.310 (0.066) 0.514 (0.007) 0.655 (0.0002) Correlation of Window Statistics and Overlap Top 1% 0.833 (0.002) 0.894 (0.0002) 0.834 (0.0002) 0.866 (0.0002) 0.869 (0.0002) Overlap Top 0.1% -0.116 (0.212) -0.664 (0) -0.096 (0.346) -0.341 (0.001) -0.647 (0) and ST F Mantel correlations between matrices of the indicated variables for each haplotype statistic. Overlap Top 1% -0.626 (0.0002) -0.885 (0) -0.200 (0.137) -0.500 (0) -0.500 (0) Table C.3: over all SNPs between pairs of populations; Correlations of Window Statistics: correlations of window statistics across all ST F : mean Haplotype Statis- tic iHS XP-EHH CEU XP-EHH YRI XP-EHH MKK XP-EHH KHB ST The indicated variables are calculatedP-values between for all the pairs Mantel of correlations, populations assessed (creating by matrices 5,000 for randomizations, which are Mantel indicated correlations in can parentheses be and calculated). in bold if significant (p-value F genomic windows between pairs ofpopulations; populations; Overlap Overlap Top Top 0.1%: 1%: number number of of shared shared windows windows in in the the empirical empirical top top 0.1% 1% between between pairs pairs of of populations. 180 APPENDIX C. SUPPLEMENT: SELECTIVE SWEEPS AND AFRICAN POPULATIONS

Table C.4: Significantly enriched Gene Ontology (GO) terms in the top empirical 1% and 5% of windows in each population. Population Statistic GO term* Name of GO term p value Windows Genes Biaka Pygmy XP-EHH CEU (0.05) GO:0016571 histone methylation 0 6 MEN1, EED, PRMT8, SUV420H2, NSD1, EHMT1 XP-EHH CEU (0.05) GO:0007089 traversing start control 5x10−5 5 CDK2, MDM2, CDC6, point of mitotic cell cy- CDC25C, MTBP cle Namibian San XP-EHH CEU (0.01) GO:0006898 (P) receptor-mediated en- 0.00045 5 PICALM, ASGR1, docytosis LRP1B, DAB2, IGF2R 6=Khomani XP-EHH CEU (0.05) GO:0006355 regulation of 0.0 180 Bushmen transcription, DNA- dependent Bantu SA XP-EHH YRI (0.01) GO:0006977 DNA damage re- 5x10−5 3 PML, GTSE1, GML sponse, signal trans- duction by p53 class mediator, cell cycle arrest Bantu Kenya XP-EHH CEU (0.05) GO:0030178 negative regulation of 5x10−5 7 DKK1, LRP1, GSC, Wnt signaling pathway TAX1BP3, FRZB, TLE1, BARX1 Mandenka iHS (0.01) GO:002504 (also P) antigen processing and 0 3 HLA-DRB5, HLA- presentation of peptide DQB2, HLA-DPA1 or polysaccharide via MHC Class II iHS (0.01) GO:0019882 (also P) antigen processing 5x10−5 5 CD1A, HFE, HLA- DRB5, HLA-DQB2, HLA-DPA1 iHS (0.01) GO:0015031 (P) protein transport 0.00145 17 XP-EHH CEU (0.01) GO:0006807 (P) nitrogen compound 0.00015 4 NIT1, GLUL, NIT2, metabolic process VNN1, VNN3 XP-EHH CEU (0.01) GO:0006366 (P) transcription from 0.0009 8 USF1, TRIM29, YBX2, RNA polymerase II HLF,SNAPC2, BAPX1, promoter CREB5, POU6F2 Yoruba XP-EHH CEU (0.01) GO:0007417 central nervous system 5x10−5 9 ACCN1, NPTX1, development NPAS1, ADAM23, DNER, CHRD, PARK2, POU6F2, GLI3 HapMap YRI XP-EHH CEU (0.01) GO:0045786 negative regulation of 0.0186 2 HRASLS3, MOV10L1 cell cycle XP-EHH CEU (0.01) GO:0006366 (P) transcription from 0.00015 9 PRDM4, MAF, MNT, RNA polymerase II YBX2, NFIX, BAPX1, promoter CREB5, POU6F2, GATA4 XP-EHH MKK (0.01) GO:0055088 lipid homeostasis 5x10−5 3 USF1, USF2, PPARG XP-EHH MKK (0.01) GO:0000432 positive regulation 0.0001 2 USF1, USF2 of transcription from RNA polymerase II promoter by glucose XP-EHH MKK (0.01) GO:0009991 response to extracellu- 0.00025 2 RASGRP4, RPS19 lar stimulus XP-EHH KHB (0.01) GO:0009991 response to extracellu- 0.0 2 RASGRP4, RPS19 lar stimulus

Significance determined as described in Methods, using an FDR cutoff of 0.05, within each population. The “Statistic” column indicates the statistic resulting in significance; the numbers in parentheses after the statistic indicate the empirical tail for which an enrichment was tested (top 0.001 are shown in Chapter4). Significant GO terms are indicated in the “GO Term” column. *Next to a GO term, “(P)” indicates that the term is only significant when examining the PANTHER subset of terms; “(also P)” indicates that the term is significant both when examining PANTHER terms and when examining all GO terms; if nothing is next to the term, it indicates significance only when all GO terms are examined. We count the number of windows in the indicated empirical tail to which each term was associated (“Windows” column); rather than reporting those windows, we report the genes within those windows that were associated with the given GO term (in the “Genes” column). Bibliography

[1] 1000 Genomes Project Consortium, G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M. Durbin, R. A. Gibbs, M. E. Hurles, and G. A. McVean. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–73, Oct 2010.

[2] M. Achtman. Evolution, population structure, and phylogeography of genetically monomorphic bac- terial pathogens. Annu Rev Microbiol, 62:53–70, 2008.

[3] J. M. Akey. Constructing genomic maps of positive selection in humans: where do we go from here? Genome Res, 19(5):711–722, May 2009.

[4] J. M. Akey, G. Zhang, K. Zhang, L. Jin, and M. D. Shriver. Interrogating a high-density SNP map for signatures of natural selection. Genome Res, 12(12):1805–1814, Dec 2002.

[5] D. H. Alexander, J. Novembre, and K. Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome Res, 19(9):1655–64, Sep 2009.

[6] C. Allix-Béguec, D. Harmsen, T. Weniger, P. Supply, and S. Niemann. Evaluation and strategy for use of MIRU-VNTRplus, a multifunctional database for online analysis of genotyping data and phy- logenetic identification of Mycobacterium tuberculosis complex isolates. J Clin Microbiol, 46(8): 2692–2699, Aug 2008.

[7] J. P. Aparicio and C. Castillo-Chavez. Mathematical modelling of tuberculosis epidemics. Math Biosci Eng, 6(2):209–37, Apr 2009.

[8] L. Avery and S. Wasserman. Ordering gene function: the interpretation of epistasis in regulatory hierarchies. Trends in Genetics, 8(9):312–6, 1992.

[9] D. Aylor and Z. Zeng. From classical genetics to quantitative genetics to systems biology: Modeling epistasis. Plos Genetics, 4(3), 2008.

[10] S.-A. Bacanu, B. Devlin, and K. Roeder. Association studies for quantitative traits in structured popu- lations. Genet Epidemiol, 22(1):78–93, Jan 2002.

[11] D. J. Balding. Likelihood-based inference for genetic correlation coefficients. Theor Popul Biol, 63 (3):221–30, May 2003.

181 182 BIBLIOGRAPHY

[12] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo. How to infer gene networks from expression profiles. Mol Syst Biol, 3:78, 2007.

[13] L. B. Barreiro, G. Laval, H. Quach, E. Patin, and L. Quintana-Murci. Natural selection has driven population differentiation in modern humans. Nat Genet, 40(3):340–5, Mar 2008.

[14] G. S. Barsh. What controls variation in human skin color? PLoS Biol, 1(1):E27, Oct 2003.

[15] N. H. Barton and M. Slatkin. A quasi-equilibrium theory of the distribution of rare alleles in a subdi- vided population. Heredity (Edinb), 56 ( Pt 3):409–15, Jun 1986.

[16] W. Bateson. Mendel’s Principles of Heredity. Cambridge University Press, 1909.

[17] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, Dec 2002.

[18] N. S. A. Becker, P. Verdu, A. Froment, S. Le Bomin, H. Pagezy, S. Bahuchet, and E. Heyer. Indirect evidence for the genetic determination of short stature in African Pygmies. Am J Phys Anthropol, 145 (3):390–401, Jul 2011.

[19] N. S. A. Becker, P. Verdu, B. Hewlett, and S. Pavard. Can life history trade-offs explain the evolution of short stature in human pygmies? A response to Migliano et al. (2007). Hum Biol, 82(1):17–27, Feb 2010.

[20] N. S. Becker, P. Verdu, M. Georges, P. Duquesnoy, A. Froment, S. Amselem, Y. Le Bouc, and E. Heyer. The role of GHR and IGF1 genes in the genetic determination of African pygmies’ short stature. Eur J Hum Genet, Oct 2012.

[21] P. Beerli and J. Felsenstein. Maximum likelihood estimation of a migration matrix and effective pop- ulation sizes in n subpopulations by using a coalescent approach. Proc Natl Acad Sci U S A, 98(8): 4563–4568, Apr 2001.

[22] D. J. Begun, A. K. Holloway, K. Stevens, L. W. Hillier, Y.-P. Poh, M. W. Hahn, P. M. Nista, C. D. Jones, A. D. Kern, C. N. Dewey, L. Pachter, E. Myers, and C. H. Langley. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol, 5(11): e310, Nov 2007.

[23] D. M. Behar, M. G. Thomas, K. Skorecki, M. F. Hammer, E. Bulygina, D. Rosengarten, A. L. Jones, K. Held, V. Moses, D. Goldstein, N. Bradman, and M. E. Weale. Multiple origins of Ashkenazi Levites: Y chromosome evidence for both Near Eastern and European ancestries. Am J Hum Genet, 73(4):768–779, Oct 2003.

[24] S. Beleza, L. Gusmão, A. Amorim, A. Carracedo, and A. Salas. The genetic legacy of western Bantu migrations. Hum Genet, 117(4):366–75, Aug 2005. BIBLIOGRAPHY 183

[25] S. Beleza, A. M. Santos, B. McEvoy, I. Alves, C. Martinho, E. Cameron, M. D. Shriver, E. J. Parra, and J. Rocha. The timing of pigmentation lightening in Europeans. Mol Biol Evol, Oct 2012.

[26] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1):289–300, 1995.

[27] S. N. Bennett, A. J. Drummond, D. D. Kapan, M. A. Suchard, J. L. Muñoz-Jordán, O. G. Pybus, E. C. Holmes, and D. J. Gubler. Epidemic dynamics revealed in dengue evolution. Mol Biol Evol, 27(4): 811–8, Apr 2010.

[28] G. Berniell-Lee, F. Calafell, E. Bosch, E. Heyer, L. Sica, P. Mouguiama-Daouda, L. van der Veen, J.- M. Hombert, L. Quintana-Murci, and D. Comas. Genetic and demographic implications of the Bantu expansion: insights from human paternal lineages. Mol Biol Evol, 26(7):1581–1589, Jul 2009.

[29] G. Bhatia, N. Patterson, B. Pasaniuc, N. Zaitlen, G. Genovese, S. Pollack, S. Mallick, S. Myers, A. Tandon, C. Spencer, C. D. Palmer, A. A. Adeyemo, E. L. Akylbekova, L. A. Cupples, J. Divers, M. Fornage, W. H. L. Kao, L. Lange, M. Li, S. Musani, J. C. Mychaleckyj, A. Ogunniyi, G. Papan- icolaou, C. N. Rotimi, J. I. Rotter, I. Ruczinski, B. Salako, D. S. Siscovick, B. O. Tayo, Q. Yang, et al. Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. Am J Hum Genet, 89(3):368–81, Sep 2011.

[30] N. Bhatti, M. R. Law, J. K. Morris, R. Halliday, and J. Moore-Gillon. Increasing incidence of tuber- culosis in England and Wales: a study of the likely causes. BMJ, 310(6985):967–9, Apr 1995.

[31] R. Biek, J. C. Henderson, L. A. Waller, C. E. Rupprecht, and L. A. Real. A high-resolution genetic signature of demographic and spatial expansion in epizootic rabies virus. Proc Natl Acad Sci U S A, 104(19):7993–8, May 2007.

[32] K. S. Blackwood, A. Al-Azem, L. J. Elliott, E. S. Hershfield, and A. M. Kabani. Conventional and molecular epidemiology of tuberculosis in Manitoba. BMC Infect Dis, 3:18, Aug 2003.

[33] M. J. Blaser and D. Kirschner. The equilibria that allow bacterial persistence in human hosts. Nature, 449(7164):843–9, Oct 2007.

[34] B. R. Bloom, editor. Tuberculosis: Pathogenesis, Protection and Control. ASM Press, 1994.

[35] B. L. Browning and S. R. Browning. A fast, powerful method for detecting identity by descent. Am J Hum Genet, 88(2):173–82, Feb 2011.

[36] S. R. Browning and B. L. Browning. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81 (5):1084–1097, Nov 2007. 184 BIBLIOGRAPHY

[37] S. R. Browning and B. L. Browning. Haplotype phasing: existing methods and new developments. Nat Rev Genet, 12(10):703–14, 2011.

[38] C. Burch and L. Chao. Epistasis and its relationship to canalization in the RNA virus phi6. Genetics, 167:559–567, 2004.

[39] C. Burch, P. Turner, and K. Hanley. Patterns of epistasis in RNA viruses: a review of the evidence from vaccine design. J. Evol. Biol., 16:1223–1235, 2003.

[40] M. C. Campbell and S. A. Tishkoff. The evolution of human genetic and phenotypic variation in Africa. Curr Biol, 20(4):R166–R173, Feb 2010.

[41] S. Canada. 2006 Census: Aboriginal Peoples in Canada. Technical report, Statistics Canada, 2006.

[42] S. I. Candille, D. M. Absher, S. Beleza, M. Bauchet, B. McEvoy, N. A. Garrison, J. Z. Li, R. M. Myers, G. S. Barsh, H. Tang, and M. D. Shriver. Genome-wide association studies of quantitatively measured skin, hair, and eye pigmentation in four European populations. PLoS One, 7(10):e48294, 2012.

[43] G. Canetti. The tubercle bacillus in the pulmonary lesion of man. The histobacteriogenesis of tubercu- losis lesions: experimental studies. Springer Publishing Company, Inc., New York, NY, pages 87–90, 1955.

[44] M. F. Cantwell, M. T. McKenna, E. McCray, and I. M. Onorato. Tuberculosis and race/ethnicity in the United States: impact of socioeconomic status. Am J Respir Crit Care Med, 157(4 Pt 1):1016–20, Apr 1998.

[45] J. P. Cegielski and D. N. McMurray. The relationship between malnutrition and tuberculosis: evidence from studies in humans and experimental animals. Int J Tuberc Lung Dis, 8(3):286–98, Mar 2004.

[46] H. Charbonneau. The First French Canadians: Pioneers in the St. Lawrence Valley. Newark, NJ: Univerity of Delaware Press; Toronto: Associated University Presses, 1993.

[47] J. M. Cheverud and E. J. Routman. Epistasis and its contribution to genetic variance components. Genetics, 139:1455–1461, 1995.

[48] K. C. Chipman and A. K. Singh. Predicting genetic interactions with random walks on biological networks. BMC Bioinformatics, 10(17), 2009.

[49] S. R. Collins, M. Schuldiner, N. J. Krogan, and J. S. Weissman. A strategy for extracting and analyzing large-scale quantitative epistatic interaction data. Genome Biology, 7:R63, 2006.

[50] S. Collins, K. M. Miller, N. L. Maas, A. Roguev, J. Fillingham, C. Chu, M. Schuldiner, M. Gebbia, J. Recht, M. Shales, H. Ding, H. Xu, J. Han, K. Ingvarsdottir, B. Cheng, B. Andrews, C. Boone, S. Berger, P. Hieter, Z. Zhang, G. Brown, C. Ingles, A. Emili, C. Allis, D. Toczyski, J. Weissman, J. F. BIBLIOGRAPHY 185

Greenblatt, and N. J. Krogan. Functional dissection of protein complexes involved in yeast chromo- some biology using a genetic interaction map. Nature, 446:806–810, 2007.

[51] D. F. Conrad, M. Jakobsson, G. Coop, X. Wen, J. D. Wall, N. A. Rosenberg, and J. K. Pritchard. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet, 38(11):1251–1260, Nov 2006.

[52] G. Coop, J. K. Pickrell, J. Novembre, S. Kudaravalli, J. Li, D. Absher, R. M. Myers, L. L. Cavalli- Sforza, M. W. Feldman, and J. K. Pritchard. The role of geography in human adaptation. PLoS Genet, 5(6):e1000500, Jun 2009.

[53] G. Coop, D. Witonsky, A. Di Rienzo, and J. K. Pritchard. Using environmental correlations to identify loci underlying local adaptation. Genetics, 185(4):1411–23, Aug 2010.

[54] H. J. Cordell. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics, 11(20):2463–2468, 2002.

[55] H. J. Cordell. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics, 10:392–404, 2009.

[56] R. Culverhouse, B. Suarez, J. Lin, and T. Reich. A perspective on epistasis: limits of models displaying no main effect. Am. J. Hum. Genet., 70:461–471, 2002.

[57] R. Dart. Recent discoveries bearing on human history in southern Africa. Journal of the Anthropolog- ical Institute of Great Britain and Ireland, pages 13–27, 1940.

[58] J. W. Daschuk, P. Hackett, and S. MacNeil. Treaties and tuberculosis: First Nations people in late 19th-century western Canada, a political and economic transformation. Can Bull Med Hist, 23(2): 307–30, 2006.

[59] A. Davierwala, J. Haynes, Z. Li, R. Brost, and et al. The synthetic genetic interaction spectrum of essential genes. Nature Genetics, 37:1147–1152, 2005.

[60] L. Decourty, C. Saveanu, K. Zemam, F. Hantraye, E. Frachon, J. Rousselle, M. Fromont-Racine, and A. Jacquier. Linking functionally related genes by sensitive and quantitative characterization of genetic interaction profiles. Proc. Natl. Acad. Sci., 105(15):5821–5826, 2008.

[61] P. B. deMenocal. African climate change and faunal evolution during the Pliocene-Pleistocene. Earth and Planetary Science Letters, 220(1-2):3 – 24, 2004. ISSN 0012-821X.

[62] P. B. deMenocal. Anthropology. climate and human evolution. Science, 331(6017):540–2, Feb 2011.

[63] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society series B, 39:1–38, 1977. 186 BIBLIOGRAPHY

[64] M. Desai, D. Weissman, and M. Feldman. Evolution can favor antagonistic epistasis. Genetics, 177: 1001–1010, 2007.

[65] B. L. Drees, V. Thorsson, G. W. Carter, A. W. Rives, M. Z. Raymond, I. Avila-Campillo, P. Shannon, and T. Galitski. Derivation of genetic interaction networks from quantitative phenotype data. Genome Biology, 6:R38, 2005.

[66] H. Duckworth. The English River Book. A North West Company Journal and Account Book of 1786, 1990.

[67] E. E. Tuberculosis in Canada. Technical report, Public Health Agency of Canada, Ottawa (Canada PHAo), 2004.

[68] M. A. Eberle, P. C. Ng, K. Kuhn, L. Zhou, D. A. Peiffer, L. Galver, K. A. Viaud-Martinez, C. T. Lawley, K. L. Gunderson, R. Shen, and S. S. Murray. Power to detect risk alleles using genome-wide tag SNP panels. PLoS Genet, 3(10):1827–37, Oct 2007.

[69] S. Elena and R. Lenski. Test of synergistic interactions among deleterious mutations in bacteria. Nature, 390:395–398, 1997.

[70] A. Estoup, I. J. Wilson, C. Sullivan, J. M. Cornuet, and C. Moritz. Inferring population history from microsatellite and enzyme data in serially introduced cane toads, Bufo marinus. Genetics, 159(4): 1671–1687, Dec 2001.

[71] A. Estoup, M. Beaumont, F. Sennedot, C. Moritz, and J.-M. Cornuet. Genetic analysis of complex demographic scenarios: spatially expanding populations of the cane toad, Bufo marinus. Evolution, 58 (9):2021–2036, Sep 2004.

[72] G. Ewing and J. Hermisson. MSMS: a coalescent simulation program including recombination, de- mographic structure and selection at a single locus. Bioinformatics, 26(16):2064–5, Aug 2010.

[73] L. Excoffier, J. Novembre, and S. Schneider. SIMCOAL: a general coalescent program for the sim- ulation of molecular data in interconnected populations with arbitrary demography. J Hered, 91(6): 506–509, 2000.

[74] L. Excoffier and H. E. L. Lischer. Arlequin suite ver 3.5: a new series of programs to perform popula- tion genetics analyses under Linux and Windows. Mol Ecol Resour, 10(3):564–7, May 2010.

[75] D. Falconer and T. Mackay. Introduction to Quantitative Genetics. Longmann, Harlow, UK., 1995.

[76] M. Feldman, F. Christiansen, and L. Brooks. Evolution of recombination in a constant environment. Proc. Natl. Acad. Sci., 77:4838–4841, 1980.

[77] R. Ferguson. Tuberculosis among the Indians of the Great Canadian Plains. London:Adlard and Sons Ltd., 1928. BIBLIOGRAPHY 187

[78] R. Fisher. The correlations between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb., 52:399–433, 1918.

[79] J. Flint and T. F. C. Mackay. Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res, 19(5):723–33, May 2009.

[80] A. Fournier-Level, A. Korte, M. D. Cooper, M. Nordborg, J. Schmitt, and A. M. Wilczek. A map of local adaptation in . Science, 334(6052):86–9, Oct 2011.

[81] C. S. Fox, N. Heard-Costa, L. A. Cupples, J. Dupuis, R. S. Vasan, and L. D. Atwood. Genome-wide association to body mass index and waist circumference: the Framingham Heart Study 100K project. BMC Med Genet, 8 Suppl 1:S18, 2007.

[82] D. Francis. Traders and Indians. The Prairie West Historical Readings, pages 58–70, 1985.

[83] W. Frankel and N. Schork. Who’s afraid of epistasis? Nat. Genet., 14:371–373, 1996.

[84] G. Friesen. The Canadian prairies: A history. University of Toronto Press, 1987.

[85] A. R. Frisancho, R. Wainwright, and A. Way. Heritability and components of phenotypic expression in skin reflectance of Mestizos from the Peruvian lowlands. Am J Phys Anthropol, 55(2):203–8, Jun 1981.

[86] S. D. W. Frost and E. M. Volz. Viral phylodynamics and the search for an "effective number of infections". Philos Trans R Soc Lond B Biol Sci, 365(1548):1879–90, Jun 2010.

[87] M. Fumagalli, M. Sironi, U. Pozzoli, A. Ferrer-Admettla, L. Pattini, and R. Nielsen. Signatures of environmental genetic adaptation pinpoint pathogens as the main selective pressure through human evolution. PLoS Genet, 7(11):e1002355, Nov 2011.

[88] S. Gagneux, K. DeRiemer, T. Van, M. Kato-Maeda, B. C. de Jong, S. Narayanan, M. Nicol, S. Nie- mann, K. Kremer, M. C. Gutierrez, M. Hilty, P. C. Hopewell, and P. M. Small. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A, 103(8):2869–2873, Feb 2006.

[89] C. R. Gignoux, B. M. Henn, and J. L. Mountain. Rapid, global demographic expansions after the origins of agriculture. Proc Natl Acad Sci U S A, 108(15):6044–9, Apr 2011.

[90] D. B. Goldstein, A. R. Linares, L. L. Cavalli-Sforza, and M. W. Feldman. An evaluation of genetic distances for use with microsatellite loci. Genetics, 139(1):463–471, Jan 1995.

[91] D. B. Goldstein, A. R. Linares, L. L. Cavalli-Sforza, and M. W. Feldman. Genetic absolute dating based on microsatellites and the origin of modern humans. Proc Natl Acad Sci U S A, 92(15):6723– 6727, Jul 1995. 188 BIBLIOGRAPHY

[92] A. Green, M. MacKinnon, and C. Minns. Conspicuous by their absence: French Canadians and the settlement of the Canadian West. The Journal of Economic History, 65(03):822–849, 2005.

[93] J. W. Gregersen, K. R. Kranc, X. Ke, P. Svendsen, L. S. Madsen, A. R. Thomsen, L. R. Cardon, J. I. Bell, and L. Fugger. Functional epistasis on a common MHC haplotype associated with multiple sclerosis. Nature, 443(7111):574–7, Oct 2006.

[94] E. Grigg. The arcana of tuberculosis with a brief epidemiologic history of the disease in the U.S.A. Am Rev Tuberc, 78(2):151–72 contd, Aug 1958.

[95] A. M. Hancock, G. Alkorta-Aranburu, D. B. Witonsky, and A. Di Rienzo. Adaptations to new envi- ronments in humans: the role of subtle allele frequency shifts. Philos Trans Roy Soc Lond B Biol Sci, 365(1552):2459–68, Aug 2010.

[96] A. M. Hancock, D. B. Witonsky, G. Alkorta-Aranburu, C. M. Beall, A. Gebremedhin, R. Sukernik, G. Utermann, J. K. Pritchard, G. Coop, and A. Di Rienzo. Adaptations to climate-mediated selective pressures in humans. PLoS Genet, 7(4):e1001375, Apr 2011.

[97] A. M. Hancock, D. B. Witonsky, E. Ehler, G. Alkorta-Aranburu, C. Beall, A. Gebremedhin, R. Suk- ernik, G. Utermann, J. Pritchard, G. Coop, and A. Di Rienzo. Colloquium paper: human adaptations to diet, subsistence, and ecoregion are due to subtle shifts in allele frequency. Proc Natl Acad Sci USA, 107 Suppl 2:8924–30, May 2010.

[98] A. Handel and M. R. Bennett. Surviving the bottleneck: Transmission mutants and the evolution of microbial populations. Genetics, 180:2193–2200, December 2008.

[99] J. Hartman and N. Tippery. Systematic quantification of gene interactions by phenotypic array analysis. Genome Biology, 5:R49, 2004.

[100] J. I. Hawker, S. S. Bakhshi, S. Ali, and C. P. Farrington. Ecological analysis of ethnic differences in relation between tuberculosis and poverty. BMJ, 319(7216):1031–4, Oct 1999.

[101] B. M. Henn, C. Gignoux, A. A. Lin, P. J. Oefner, P. Shen, R. Scozzari, F. Cruciani, S. A. Tishkoff, J. L. Mountain, and P. A. Underhill. Y-chromosomal evidence of a pastoralist migration through Tanzania to southern Africa. Proc Natl Acad Sci USA, 105(31):10693–10698, Aug 2008.

[102] B. M. Henn, C. R. Gignoux, M. W. Feldman, and J. L. Mountain. Characterizing the time dependency of human mitochondrial DNA mutation rate estimates. Mol Biol Evol, 26(1):217–30, Jan 2009.

[103] B. M. Henn, C. R. Gignoux, M. Jobin, J. M. Granka, J. M. Macpherson, J. M. Kidd, L. Rodríguez- Botigué, S. Ramachandran, L. Hon, A. Brisbin, A. A. Lin, P. A. Underhill, D. Comas, K. K. Kidd, P. J. Norman, P. Parham, C. D. Bustamante, J. L. Mountain, and M. W. Feldman. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc Natl Acad Sci U S A, 108(13): 5154–62, Mar 2011. BIBLIOGRAPHY 189

[104] B. M. Henn, L. Hon, J. M. Macpherson, N. Eriksson, S. Saxonov, I. Pe’er, and J. L. Mountain. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS One, 7(4): e34267, 2012.

[105] J. Hermisson. Who believes in whole-genome scans for selection? Heredity (Edinb), 103(4):283–4, Oct 2009.

[106] R. D. Hernandez, J. L. Kelley, E. Elyashiv, S. C. Melton, A. Auton, G. McVean, 1000 Genomes Project, G. Sella, and M. Przeworski. Classic selective sweeps were rare in recent human evolution. Science, 331(6019):920–4, Feb 2011.

[107] R. Hershberg, M. Lipatov, P. M. Small, H. Sheffer, S. Niemann, S. Homolka, J. C. Roach, K. Kre- mer, D. A. Petrov, M. W. Feldman, and S. Gagneux. High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol, 6(12):e311, Dec 2008.

[108] J. Hey. On the number of New World founders: a population genetic portrait of the peopling of the Americas. PLoS Biol, 3(6):e193, Jun 2005.

[109] J. Hey, Y.-J. Won, A. Sivasundar, R. Nielsen, and J. A. Markert. Using nuclear haplotypes with microsatellites to study gene flow between recently separated cichlid species. Mol Ecol, 13(4):909– 919, Apr 2004.

[110] D. A. Hinds, L. L. Stuve, G. B. Nilsen, E. Halperin, E. Eskin, D. G. Ballinger, K. A. Frazer, and D. R. Cox. Whole-genome patterns of common DNA variation in three human populations. Science, 307 (5712):1072–9, Feb 2005.

[111] A. E. Hirsh, A. G. Tsolaki, K. DeRiemer, M. W. Feldman, and P. M. Small. Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc Natl Acad Sci U S A, 101(14):4871–4876, Apr 2004.

[112] S. Y. W. Ho, B. Shapiro, M. J. Phillips, A. Cooper, and A. J. Drummond. Evidence for time dependency of molecular rate estimates. Syst Biol, 56(3):515–22, Jun 2007.

[113] T. Hofer, N. Ray, D. Wegmann, and L. Excoffier. Large allele frequency differences between human continental groups are more likely to have occurred by drift during range expansions than by selection. Ann Hum Genet, 73(1):95–108, Jan 2009.

[114] P. Holmans, E. K. Green, J. S. Pahwa, M. A. R. Ferreira, S. M. Purcell, P. Sklar, Wellcome Trust Case-Control Consortium, M. J. Owen, M. C. O’Donovan, and N. Craddock. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet, 85 (1):13–24, Jul 2009.

[115] K. E. Holsinger and B. S. Weir. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet, 10(9):639–50, Sep 2009. 190 BIBLIOGRAPHY

[116] S. Homolka, E. Post, B. Oberhauser, A. G. George, L. Westman, F. Dafae, S. Rüsch-Gerdes, and S. Niemann. High genetic diversity among Mycobacterium tuberculosis complex strains from Sierra Leone. BMC Microbiol, 8:103, 2008.

[117] D. W. Huang, B. T. Sherman, and R. A. Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res, 37(1):1–13, Jan 2009.

[118] D. W. Huang, B. T. Sherman, and R. A. Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 4(1):44–57, 2009.

[119] J. G. Im, H. Itoh, Y. S. Shim, J. H. Lee, J. Ahn, M. C. Han, and S. Noma. Pulmonary tuberculosis: CT findings–early active disease and sequential change with antituberculous therapy. Radiology, 186(3): 653–60, Mar 1993.

[120] H. Innis. The fur trade in Canada: An introduction to Canadian economic history. University of Toronto Press, 1999.

[121] International HapMap 3 Consortium, D. M. Altshuler, R. A. Gibbs, L. Peltonen, E. Dermitzakis, S. F. Schaffner, F. Yu, P. E. Bonnen, P. I. W. de Bakker, P. Deloukas, S. B. Gabriel, R. Gwilliam, S. Hunt, M. Inouye, X. Jia, A. Palotie, M. Parkin, P. Whittaker, K. Chang, A. Hawes, L. R. Lewis, Y. Ren, D. Wheeler, D. M. Muzny, C. Barnes, K. Darvishi, M. Hurles, J. M. Korn, K. Kristiansson, C. Lee, McCarrol, et al. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52–8, Sep 2010.

[122] International HapMap Consortium. A haplotype map of the human genome. Nature, 437(7063): 1299–1320, Oct 2005.

[123] International HapMap Consortium, K. A. Frazer, D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, T. D. Willis, F. Yu, H. Yang, C. Zeng, Y. Gao, H. Hu, W. Hu, C. Li, W. Lin, S. Liu, H. Pan, X. Tang, J. Wang, W. Wang, J. Yu, B. Zhang, Q. Zhang, H. Zhao, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449(7164):851–61, Oct 2007.

[124] M. D. Iseman. A Clinician’s Guide to Tuberculosis. Lippincott Williams and Wilkins, Philadelphia, 2000.

[125] N. Izagirre, I. García, C. Junquera, C. de la Rúa, and S. Alonso. A scan for signatures of positive selection in candidate loci for skin pigmentation in humans. Mol Biol Evol, 23(9):1697–706, Sep 2006.

[126] N. G. Jablonski and G. Chaplin. The evolution of human skin coloration. J Hum Evol, 39(1):57–106, Jul 2000. BIBLIOGRAPHY 191

[127] N. Jablonski. Living Color: The Biological and Social Meaning of Skin Color. University of California Press, 2012.

[128] M. Jakobsson and N. A. Rosenberg. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics, 23 (14):1801–6, Jul 2007.

[129] M. Jakobsson, S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H.-C. Fung, Z. A. Szpiech, J. H. Degnan, K. Wang, R. Guerreiro, J. M. Bras, J. C. Schymick, D. G. Hernandez, B. J. Traynor, J. Simon- Sanchez, M. Matarin, A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H. M. Cann, J. A. Hardy, N. A. Rosenberg, and A. B. Singleton. Genotype, haplotype and copy-number variation in worldwide human populations. Nature, 451(7181):998–1003, Feb 2008.

[130] J. P. Jarvis, L. B. Scheinfeldt, S. Soi, C. Lambert, L. Omberg, B. Ferwerda, A. Froment, J.-M. Bodo, W. Beggs, G. Hoffman, J. Mezey, and S. A. Tishkoff. Patterns of ancestry, signatures of natural selec- tion, and genetic association with stature in Western African pygmies. PLoS Genet, 8(4):e1002641, Apr 2012.

[131] L. Jasnos and R. Korona. Epistatic buffering of fitness loss in yeast double deletion strains. Nat. Genet., 39(4):550–554, 2007.

[132] H. Joachim, U. N. Riede, and C. Mittermayer. The weight of human lungs as a diagnostic criterium (distinction of normal lungs from shock lungs by histologic, morphometric and biochemical investiga- tions). Pathol Res Pract, 162(1):24–40, May 1978.

[133] B. I. Joffe, W. P. Jackson, M. E. Thomas, M. G. Toyer, P. Keller, and B. L. Pimstone. Metabolic responses to oral glucose in the Kalahari Bushmen. Br Med J, 4(5781):206–8, Oct 1971.

[134] A. Johansson and U. Gyllensten. Identification of local selective sweeps in human populations since the exodus from Africa. Hereditas, 145(3):126–37, Jun 2008.

[135] N. A. Johnson, M. A. Coram, M. D. Shriver, I. Romieu, G. S. Barsh, S. J. London, and H. Tang. Ancestral components of admixed genomes in a Mexican cohort. PLoS Genet, 7(12):e1002410, Dec 2011.

[136] S. T. Kalinowski. Counting alleles with rarefaction: Private alleles and hierarchichal sampling designs. Conservation Genetics, 5(4):539–543, August 2004.

[137] J. Kamerbeek, L. Schouls, A. Kolk, M. van Agterveld, D. van Soolingen, S. Kuijper, A. Bunschoten, H. Molhuizen, R. Shaw, M. Goyal, and J. van Embden. Simultaneous detection and strain differentia- tion of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol, 35(4):907–14, Apr 1997. 192 BIBLIOGRAPHY

[138] H. M. Kang, J. H. Sul, S. K. Service, N. A. Zaitlen, S.-Y. Kong, N. B. Freimer, C. Sabatti, and E. Eskin. Variance component model to account for sample structure in genome-wide association studies. Nat Genet, 42(4):348–54, Apr 2010.

[139] S. Kathiresan, A. K. Manning, S. Demissie, R. B. D’Agostino, A. Surti, C. Guiducci, L. Gianniny, N. P. Burtt, O. Melander, M. Orho-Melander, D. K. Arnett, G. M. Peloso, J. M. Ordovas, and L. A. Cupples. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med Genet, 8 Suppl 1:S17, 2007.

[140] M. Kayser, S. Brauer, and M. Stoneking. A genome scan to detect candidate regions influenced by local natural selection in human populations. Mol Biol Evol, 20(6):893–900, Jun 2003.

[141] P. D. Keightley and S. P. Otto. Interference among deleterious mutations favours sex and recombination in finite populations. Nature, 443:89–92, 2006.

[142] E. E. Kenny, N. J. Timpson, M. Sikora, M.-C. Yee, A. Moreno-Estrada, C. Eng, S. Huntsman, E. G. Burchard, M. Stoneking, C. D. Bustamante, and S. Myles. Melanesian blond hair is caused by an change in TYRP1. Science, 336(6081):554, May 2012.

[143] A. Knight, P. A. Underhill, H. M. Mortensen, L. A. Zhivotovsky, A. A. Lin, B. M. Henn, D. Louis, M. Ruhlen, and J. L. Mountain. African Y chromosome and mtDNA divergence provides insight into the history of click languages. Curr Biol, 13(6):464–473, Mar 2003.

[144] A. Kondrashov. Muller’s ratchet under epistatic selection. Genetics, 136:1469–1473, 1994.

[145] D. Kunimoto, P. Chedore, R. Allen, and S. Kasatiya. Investigation of tuberculosis transmission in Canadian Arctic Inuit communities using DNA fingerprinting. Int J Tuberc Lung Dis, 5(7):642–7, Jul 2001.

[146] L. Kupper and M. Hogan. Interaction in epidemiologic studies. American Journal of Epidemiology, 108(6):447–453, 1978.

[147] J. Lachance, B. Vernot, C. C. Elbers, B. Ferwerda, A. Froment, J.-M. Bodo, G. Lema, W. Fu, T. B. Nyambo, T. R. Rebbeck, K. Zhang, J. M. Akey, and S. A. Tishkoff. Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse african hunter-gatherers. Cell, 150(3):457– 69, Aug 2012.

[148] C. A. Lambert and S. A. Tishkoff. Genetic structure in African populations: implications for human demographic history. Cold Spring Harb Symp Quant Biol, 74:395–402, 2009.

[149] H. Lango Allen, K. Estrada, G. Lettre, S. I. Berndt, M. N. Weedon, F. Rivadeneira, C. J. Willer, A. U. Jackson, S. Vedantam, S. Raychaudhuri, T. Ferreira, A. R. Wood, R. J. Weyant, A. V. Segrè, E. K. Speliotes, E. Wheeler, N. Soranzo, J.-H. Park, J. Yang, D. Gudbjartsson, N. L. Heard-Costa, J. C. BIBLIOGRAPHY 193

Randall, L. Qi, A. Vernon Smith, R. Mägi, T. Pastinen, L. Liang, I. M. Heid, J. Luan, G. Thorleifsson, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467(7317):832–8, Oct 2010.

[150] M. G. Larson, L. D. Atwood, E. J. Benjamin, L. A. Cupples, R. B. D’Agostino, Sr, C. S. Fox, D. R. Govindaraju, C.-Y. Guo, N. L. Heard-Costa, S.-J. Hwang, J. M. Murabito, C. Newton-Cheh, C. J. O’Donnell, S. Seshadri, R. S. Vasan, T. J. Wang, P. A. Wolf, and D. Levy. Framingham Heart study 100K project: genome-wide associations for cardiovascular disease outcomes. BMC Med Genet, 8 Suppl 1:S5, 2007.

[151] G. Laval and L. Excoffier. SIMCOAL 2.0: a program to simulate genomic diversity over large recom- bining regions in a subdivided population with a complex history. Bioinformatics, 20(15):2485–2487, Oct 2004.

[152] I. Lee and et al. A probabilistic functional network of yeast genes. Science, 306(1555), 2004.

[153] B. Lehne, C. M. Lewis, and T. Schlitt. From SNPs to genes: disease association at the gene level. PLoS One, 6(6):e20133, 2011.

[154] G. Lettre, A. U. Jackson, C. Gieger, F. R. Schumacher, S. I. Berndt, S. Sanna, S. Eyheramendy, B. F. Voight, J. L. Butler, C. Guiducci, T. Illig, R. Hackett, I. M. Heid, K. B. Jacobs, V. Lyssenko, M. Uda, Diabetes Genetics Initiative, FUSION, KORA, Prostate, Lung Colorectal and Ovarian Cancer Screen- ing Trial, Nurses’ Health Study, SardiNIA, M. Boehnke, S. J. Chanock, L. C. Groop, F. B. Hu, B. Iso- maa, P. Kraft, L. Peltonen, V. Salomaa, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet, 40(5):584–91, May 2008.

[155] C. Leuenberger and D. Wegmann. Bayesian computation and model selection without likelihoods. Genetics, 184(1):243–252, Jan 2010.

[156] H. Li, G. Gao, J. Li, G. P. Page, and K. Zhang. Detecting epistatic interactions contributing to human gene expression using the CEPH family data. BMC Proc, 1 Suppl 1:S67, 2007.

[157] J. Z. Li, D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, S. Ramachandran, H. M. Cann, G. S. Barsh, M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers. Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319(5866):1100–1104, Feb 2008.

[158] U. Liberman and M. W. Feldman. Evolutionary theory for modifiers of epistasis using a general symmetric model. Proc. Natl. Acad. Sci., 103:19402–19406, 2006.

[159] K. E. Lohmueller, A. Albrechtsen, Y. Li, S. Y. Kim, T. Korneliussen, N. Vinckenbosch, G. Tian, E. Huerta-Sanchez, A. F. Feder, N. Grarup, T. Jørgensen, T. Jiang, D. R. Witte, A. Sandbæk, I. Hell- mann, T. Lauritzen, T. Hansen, O. Pedersen, J. Wang, and R. Nielsen. Natural selection affects multiple 194 BIBLIOGRAPHY

aspects of genetic variation at putatively neutral sites across the human genome. PLoS Genet, 7(10): e1002326, Oct 2011.

[160] K. Lönnroth, B. G. Williams, P. Cegielski, and C. Dye. A consistent log-linear relationship between tuberculosis incidence and body mass index. Int J Epidemiol, 39(1):149–55, Feb 2010.

[161] M. Lux. Medicine that walks: Disease, medicine and Canadian Plains Native people, 1880-1940. University of Toronto Press, 2001.

[162] M. Lynch, B. Walsh, et al. Genetics and analysis of quantitative traits. Sinauer Sunderland, MA, 1998.

[163] T. MacCarthy and A. Bergman. Coevolution of robustness, epistasis, and recombination favors asexual reproduction. Proc. Natl. Acad. Sci., 104(31):12801–12806, 2007.

[164] P. Mangtani, D. J. Jolley, J. M. Watson, and L. C. Rodrigues. Socioeconomic deprivation and notifica- tion rates for tuberculosis in London during 1982-91. BMJ, 310(6985):963–6, Apr 1995.

[165] R. Mani, R. P. St Onge, J. L. I. Hartman, G. Giaever, and F. P. Roth. Defining genetic interaction. Proc. Natl. Acad. Sci., 105(9):3461–66, 2008.

[166] J. Marchini and B. Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet, 11(7):499–511, Jul 2010.

[167] P. Marjoram and S. Tavare. Modern computational approaches for analysing molecular genetic varia- tion data. Nat Rev Genet, 7(10):759–770, Oct 2006.

[168] B. McEvoy, S. Beleza, and M. D. Shriver. The genetic architecture of normal variation in human pigmentation: an evolutionary perspective and model. Hum Mol Genet, 15 Spec No 2:R176–81, Oct 2006.

[169] C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar, C. B. Lowe, A. M. Wenger, and G. Bejerano. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol, 28 (5):495–501, May 2010.

[170] M. Metspalu, I. G. Romero, B. Yunusbayev, G. Chaubey, C. B. Mallick, G. Hudjashov, M. Nelis, R. Mägi, E. Metspalu, M. Remm, R. Pitchappan, L. Singh, K. Thangaraj, R. Villems, and T. Kivisild. Shared and unique components of human population structure and genome-wide signals of positive selection in South Asia. Am J Hum Genet, 89(6):731–44, Dec 2011.

[171] A. B. Migliano, L. Vinicius, and M. M. Lahr. Life history trade-offs explain the evolution of human pygmies. Proc Natl Acad Sci U S A, 104(51):20216–9, Dec 2007.

[172] J. Milloy. A national crime: The Canadian government and the residential school system, 1879 to 1986, volume 11. University of Manitoba Press, 1999. BIBLIOGRAPHY 195

[173] Y. Moodley, B. Linz, Y. Yamaoka, H. M. Windsor, S. Breurec, J.-Y. Wu, A. Maady, S. Bernhöft, J.-M. Thiberge, S. Phuanukoonnon, G. Jobb, P. Siba, D. Y. Graham, B. J. Marshall, and M. Achtman. The peopling of the Pacific from a bacterial perspective. Science, 323(5913):527–530, Jan 2009.

[174] J. H. Moore and S. M. Williams. Epistasis and its implications for personal genetics. Am J Hum Genet, 85(3):309–20, Sep 2009.

[175] J. Moore. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered., 56:73–82, 2003.

[176] J. Moore and S. Williams. Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. BioEssays, 27(6):637–646, 2005.

[177] P. A. Moran. Wandering distributions and the electrophoretic profile. Theor Popul Biol, 8(3):318–30, Dec 1975.

[178] A. Morris. Isolation and the Origin of the Khoisan: Late Pleistocene and early Holocene human evolution at the southern end of Africa. Human Evolution, 17(3-4):231–240, 2002.

[179] S. Myles, K. Tang, M. Somel, R. E. Green, J. Kelso, and M. Stoneking. Identification and analysis of genomic regions with large between-population differentiation in humans. Ann Hum Genet, 72(Pt 1): 99–110, Jan 2008.

[180] S. Myles, M. Somel, K. Tang, J. Kelso, and M. Stoneking. Identifying genes underlying skin pigmen- tation differences among human populations. Hum Genet, 120(5):613–621, Jan 2007.

[181] NaturalResourcesCanada. Posts of the canadian fur trade. Available at http://atlas.nrcan.gc.ca/site/english/maps/archives/4thedition/historical/079_80. Accessed September 29, 2010., 1974.

[182] NaturalResourcesCanada. Lakes, Rivers and Names of Canada. Available at http://atlas.nrcan.gc.ca/site/english/maps/reference/outlinecanada/canada06. Accessed September 29, 2010., 2007.

[183] A. N’Diaye, G. K. Chen, C. D. Palmer, B. Ge, B. Tayo, R. A. Mathias, J. Ding, M. A. Nalls, A. Adeyemo, V. Adoue, C. B. Ambrosone, L. Atwood, E. V. Bandera, L. C. Becker, S. I. Berndt, L. Bernstein, W. J. Blot, E. Boerwinkle, A. Britton, G. Casey, S. J. Chanock, E. Demerath, S. L. Deming, W. R. Diver, C. Fox, T. B. Harris, D. G. Hernandez, J. J. Hu, S. A. Ingles, E. M. John, et al. Identification, replication, and fine-mapping of loci associated with adult height in individuals of African ancestry. PLoS Genet, 7(10):e1002298, Oct 2011.

[184] A. C. Need and D. B. Goldstein. Next generation disparities in human genomics: concerns and reme- dies. Trends Genet, 25(11):489–94, Nov 2009. 196 BIBLIOGRAPHY

[185] D. Nguyen, P. Brassard, D. Menzies, L. Thibert, R. Warren, S. Mostowy, and M. Behr. Genomic characterization of an endemic Mycobacterium tuberculosis strain: evolutionary and epidemiologic implications. J Clin Microbiol, 42(6):2573–2580, Jun 2004.

[186] D. Nguyen, J.-F. Proulx, J. Westley, L. Thibert, S. Dery, and M. A. Behr. Tuberculosis in the Inuit community of Quebec, Canada. Am J Respir Crit Care Med, 168(11):1353–7, Dec 2003.

[187] R. Nielsen and J. Wakeley. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics, 158(2):885–896, Jun 2001.

[188] R. Nielsen, I. Hellmann, M. Hubisz, C. Bustamante, and A. G. Clark. Recent and ongoing selection in the human genome. Nat Rev Genet, 8(11):857–868, Nov 2007.

[189] H. L. Norton, J. S. Friedlaender, D. A. Merriwether, G. Koki, C. S. Mgone, and M. D. Shriver. Skin and hair pigmentation variation in Island Melanesia. Am J Phys Anthropol, 130(2):254–68, Jun 2006.

[190] J. Novembre and E. Han. Human population structure and the adaptive response to pathogen-induced selection pressures. Philos Trans R Soc Lond B Biol Sci, 367(1590):878–86, Mar 2012.

[191] U. Nübel, J. Dordel, K. Kurt, B. Strommenger, H. Westh, S. K. Shukla, H. Zemlicková, R. Leblois, T. Wirth, T. Jombart, F. Balloux, and W. Witte. A timescale for evolution, population expansion, and spatial spread of an emerging clone of methicillin-resistant Staphylococcus aureus. PLoS Pathog, 6 (4):e1000855, Apr 2010.

[192] Y. Okada, Y. Kamatani, A. Takahashi, K. Matsuda, N. Hosono, H. Ohmiya, Y. Daigo, K. Yamamoto, M. Kubo, Y. Nakamura, and N. Kamatani. A genome-wide association study in 19,633 Japanese subjects identified LHX3-QSOX2 and IGF1 as adult height loci. Hum Mol Genet, 19(11):2303–12, Jun 2010.

[193] S. O’Keefe and R. Lavender. The plight of modern Bushmen. The Lancet, 334(8657):255–258, 1989.

[194] S. O’Keefe, J. Rund, N. Marot, K. Symmonds, and G. Berger. Nutritional status, dietary intake and disease patterns in rural Hereros, Kavangos and Bushmen in South West Africa/Namibia. S Afr Med J, 73:643–648, 1988.

[195] T. Ota and M. Kimura. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet Res, 22(2):201–4, Oct 1973.

[196] S. P. Otto and M. Feldman. Deleterious mutations, variable epistatic interactions, and the evolution of recombination. Theoretical Population Biology, 51(2):134–147, 1997.

[197] X. Pan, P. Ye, D. Yuan, X. Wang, J. Bader, and J. Boeke. A DNA integrity network in the yeast Saccharomyces cerevisiae. Cell, 124:1069–1081, 2006. BIBLIOGRAPHY 197

[198] X. Pan, D. Yuan, D. Xiang, X. Wang, S. Sookhai-Mahadeo, J. Bader, P. Hieter, F. Spencer, and J. Boeke. A robust toolkit for functional profiling of the yeast genome. Mol. Cell, 16:487–496, 2004.

[199] E. J. Parra. Human pigmentation variation: evolution, genetic basis, and implications for public health. Am J Phys Anthropol, Suppl 45:85–105, 2007.

[200] M. Pascual, M. P. Chapuis, F. Mestres, J. Balanyà, R. B. Huey, G. W. Gilchrist, L. Serra, and A. Estoup. Introduction history of Drosophila subobscura in the New World: a microsatellite-based survey using ABC methods. Mol Ecol, 16(15):3069–3083, Aug 2007.

[201] T. J. Pemberton, C. Wang, J. Z. Li, and N. A. Rosenberg. Inference of unexpected genetic relatedness among individuals in HapMap Phase III. Am J Hum Genet, 87(4):457–464, Oct 2010.

[202] N. Penn. The forgotten frontier: colonist and Khoisan on the Cape’s northern frontier in the 18th century. Juta and Company Ltd, 2005.

[203] C. Pepperell, V. H. Hoeppner, M. Lipatov, W. Wobeser, G. K. Schoolnik, and M. W. Feldman. Bacterial genetic signatures of human social phenomena among M. tuberculosis from an Aboriginal Canadian population. Mol Biol Evol, 27(2):427–440, Feb 2010.

[204] G. H. Perry and N. J. Dominy. Evolution of the human pygmy phenotype. Trends Ecol Evol, 24(4): 218–25, Apr 2009.

[205] P. Phillips. The language of gene interaction. Genetics, 149:1167–1171, 1998.

[206] P. Phillips. Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics, 9:855–867, 2008.

[207] J. K. Pickrell, G. Coop, J. Novembre, S. Kudaravalli, J. Z. Li, D. Absher, B. S. Srinivasan, G. S. Barsh, R. M. Myers, M. W. Feldman, and J. K. Pritchard. Signals of recent positive selection in a worldwide sample of human populations. Genome Res, 19(5):826–837, May 2009.

[208] A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet, 38(8): 904–9, Aug 2006.

[209] A. L. Price, N. A. Zaitlen, D. Reich, and N. Patterson. New approaches to population stratification in genome-wide association studies. Nat Rev Genet, 11(7):459–63, Jul 2010.

[210] J. K. Pritchard, M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol, 16(12):1791–1798, Dec 1999.

[211] J. K. Pritchard and A. Di Rienzo. Adaptation - not by sweeps alone. Nat Rev Genet, 11(10):665–7, Oct 2010. 198 BIBLIOGRAPHY

[212] J. K. Pritchard, J. K. Pickrell, and G. Coop. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr Biol, 20(4):R208–R215, Feb 2010.

[213] R. J. Pruim, R. P. Welch, S. Sanna, T. M. Teslovich, P. S. Chines, T. P. Gliedt, M. Boehnke, G. R. Abecasis, and C. J. Willer. LocusZoom: regional visualization of genome-wide association scan re- sults. Bioinformatics, 26(18):2336–7, Sep 2010.

[214] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 81(3):559–575, Sep 2007.

[215] A. R. Quinlan and I. M. Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, March 2010.

[216] S. Ramachandran, O. Deshpande, C. C. Roseman, N. A. Rosenberg, M. W. Feldman, and L. L. Cavalli- Sforza. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A, 102(44):15942–15947, Nov 2005.

[217] A. Ray. The Canadian fur trade in the industrial age. University of Toronto Press, 1990.

[218] A. Ray, J. Miller, and F. Tough. Bounty and Benevolence: A Documentary History of Saskatchewan Treaties, volume 23. McGill-Queen’s University Press, 2002.

[219] N. Ray, D. Wegmann, N. J. R. Fagundes, S. Wang, A. Ruiz-Linares, and L. Excoffier. A statistical evaluation of models for the initial settlement of the American continent emphasizes the importance of gene flow with Asia. Mol Biol Evol, 27(2):337–345, Feb 2010.

[220] M. Raymond and F. Rousset. GENEPOP (version 1.2): Population genetics software for exact tests and ecumenicism. Journal of Heredity, 86(3):248–249, 1995.

[221] J. L. Rees. Genetics of hair and skin color. Annu Rev Genet, 37:67–90, 2003.

[222] J. H. Relethford. Hemispheric difference in human skin color. Am J Phys Anthropol, 104(4):449–57, Dec 1997.

[223] J. F. Reyes and M. M. Tanaka. Mutation rates of spoligotypes and variable numbers of tandem repeat loci in Mycobacterium tuberculosis. Infect Genet Evol, 10(7):1046–51, Oct 2010.

[224] R. L. Riley, C. C. Mills, F. O’Grady, L. U. Sultan, F. Wittstadt, and D. N. Shivipuri. Infectiousness of air from a tuberculosis ward. Ultraviolet irradiation of infected air: comparative infectiousness of different patients. Am Rev Respir Dis, 85:511–25, Apr 1962.

[225] A. Robins. Biological perspectives on human pigmentation, volume 7. Cambridge University Press, 2005. BIBLIOGRAPHY 199

[226] A. Roguev, S. Bandyopadhyay, M. Zofall, K. Zhang, T. Fischer, S. R. Collins, H. Qu, M. Shales, H.- O. Park, J. Hayles, K.-L. Hoe, D.-U. Kim, T. Ideker, S. I. Grewal, J. S. Weissman, and N. J. Krogan. Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast. Science, 322(5900):405–10, Oct 2008.

[227] N. A. Rosenberg, J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd, L. A. Zhivotovsky, and M. W. Feldman. Genetic structure of human populations. Science, 298(5602):2381–5, Dec 2002.

[228] N. A. Rosenberg, A. G. Tsolaki, and M. M. Tanaka. Estimating change rates of genetic markers using serial samples: applications to the transposon IS6110 in Mycobacterium tuberculosis. Theoretical Population Biology, 63:347–363, 2003.

[229] K. Rothman, S. Greenland, and A. Walker. Concepts of interaction. American Journal of Epidemiol- ogy, 112(4):467–470, 1980.

[230] F. Rousset. GENEPOP’007: A complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources, 8(1):103–106, 2008. ISSN 1755-0998.

[231] P. C. Sabeti, S. F. Schaffner, B. Fry, J. Lohmueller, P. Varilly, O. Shamovsky, A. Palma, T. S. Mikkelsen, D. Altshuler, and E. S. Lander. Positive natural selection in the human lineage. Science, 312(5780): 1614–1620, Jun 2006.

[232] P. C. Sabeti, D. E. Reich, J. M. Higgins, H. Z. P. Levine, D. J. Richter, S. F. Schaffner, S. B. Gabriel, J. V. Platko, N. J. Patterson, G. J. McDonald, H. C. Ackerman, S. J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward, and E. S. Lander. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419(6909):832–837, Oct 2002.

[233] P. C. Sabeti, P. Varilly, B. Fry, J. Lohmueller, E. Hostetter, C. Cotsapas, X. Xie, E. H. Byrne, S. A. McCarroll, R. Gaudet, S. F. Schaffner, E. S. Lander, International HapMap Consortium, K. A. Frazer, D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, T. D. Willis, F. Yu, H. Yang, C. Zeng, Y. Gao, et al. Genome-wide detection and characterization of positive selection in human populations. Nature, 449(7164):913–8, Oct 2007.

[234] P. C. Sabeti, E. Walsh, S. F. Schaffner, P. Varilly, B. Fry, H. B. Hutcheson, M. Cullen, T. S. Mikkelsen, J. Roy, N. Patterson, R. Cooper, D. Reich, D. Altshuler, S. O’Brien, and E. S. Lander. The case for selection at CCR5-Delta32. PLoS Biol, 3(11):e378, Nov 2005.

[235] R. Sanjuán, J. Cuevas, A. Moya, and S. Elena. Epistasis and the adaptability of an RNA virus. Genetics, 170:1001–1008, 2005.

[236] R. Sanjuan and S. Elena. Epistasis correlates to genomic complexity. Proc. Natl. Acad. Sci., 103: 104402–104405, 2006. 200 BIBLIOGRAPHY

[237] S. Sanna, A. U. Jackson, R. Nagaraja, C. J. Willer, W.-M. Chen, L. L. Bonnycastle, H. Shen, N. Timp- son, G. Lettre, G. Usala, P. S. Chines, H. M. Stringham, L. J. Scott, M. Dei, S. Lai, G. Albai, L. Crisponi, S. Naitza, K. F. Doheny, E. W. Pugh, Y. Ben-Shlomo, S. Ebrahim, D. A. Lawlor, R. N. Bergman, R. M. Watanabe, M. Uda, J. Tuomilehto, J. Coresh, J. N. Hirschhorn, A. R. Shuldiner, et al. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet, 40(2):198–203, Feb 2008.

[238] S. F. Schaffner, C. Foo, S. Gabriel, D. Reich, M. J. Daly, and D. Altshuler. Calibrating a coalescent simulation of human genome sequence variation. Genome Res, 15(11):1576–83, Nov 2005.

[239] I. Schapera. The Khoisan Peoples of South Africa: Bushmen and Hottentots. Taylor & Francis, 1965.

[240] M. Schuldiner, S. Collins, N. Thompson, V. Denic, A. Bhamidipati, T. Punna, J. Ihmels, B. Andrews, C. Boone, and J. Greenblatt. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell, 123(3):507–519, 2005.

[241] M. Schuldiner, S. Collins, J. Weissmana, and N. Krogan. Quantitative genetic analysis in Saccharomyces cerevisiae using epistatic miniarray profiles (E-MAPs) and its application to chromatin functions. Methods, 40(4):344–352, 2006.

[242] S. C. Schuster, W. Miller, A. Ratan, L. P. Tomsho, B. Giardine, L. R. Kasson, R. S. Harris, D. C. Petersen, F. Zhao, J. Qi, C. Alkan, J. M. Kidd, Y. Sun, D. I. Drautz, P. Bouffard, D. M. Muzny, J. G. Reid, L. V. Nazareth, Q. Wang, R. Burhans, C. Riemer, N. E. Wittekindt, P. Moorjani, E. A. Tindall, C. G. Danko, W. S. Teo, A. M. Buboltz, Z. Zhang, Q. Ma, A. Oosthuysen, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature, 463(7283):943–947, Feb 2010.

[243] Schwartz and et al. Cost-effective strategies for completing the interactome. Nature Methods, 6(1), 2009.

[244] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978.

[245] D. Segre, A. DeLuna, G. Church, and R. Kishony. Modular epistasis in yeast metabolism. Nat. Genet., 37:77–83, 2005.

[246] M. F. Seldin, B. Pasaniuc, and A. L. Price. New approaches to disease mapping in admixed popula- tions. Nat Rev Genet, 12(8):523–8, Aug 2011.

[247] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.

[248] B. T. Shea and R. C. Bailey. Allometry and adaptation of body proportions and stature in African pygmies. Am J Phys Anthropol, 100(3):311–40, Jul 1996. BIBLIOGRAPHY 201

[249] D. Shriner, A. Adeyemo, E. Ramos, G. Chen, and C. N. Rotimi. Mapping of disease-associated variants in admixed populations. Genome Biol, 12(5):223, 2011.

[250] M. D. Shriver and E. J. Parra. Comparison of narrow-band reflectance spectroscopy and tristimulus colorimetry for measurements of skin and hair color in persons of different biological ancestry. Am J Phys Anthropol, 112(1):17–27, May 2000.

[251] M. Slatkin. Gene flow in natural populations. Annual review of ecology and systematics, pages 393– 430, 1985.

[252] M. Slatkin. Gene flow and the geographic structure of natural populations. Science, 236(4803):787–92, May 1987.

[253] M. Slatkin and N. Barton. A comparison of three indirect methods for estimating average levels of gene flow. Evolution, pages 1349–1368, 1989.

[254] N. H. Smith, R. G. Hewinson, K. Kremer, R. Brosch, and S. V. Gordon. Myths and misconceptions: the origin and evolution of Mycobacterium tuberculosis. Nat Rev Microbiol, 7(7):537–544, Jul 2009.

[255] R. P. St Onge, R. Mani, J. Oh, M. Proctor, E. Fung, R. W. Davis, C. Nislow, F. P. Roth, and G. Giaever. Systematic pathway analysis using high-resolution fitness profiling of combinatorial gene deletions. Nat. Genet., 39:199–206, 2007.

[256] S. M. Stanley, T. L. Bailey, and J. S. Mattick. GONOME: measuring correlations between GO terms and genomic positions. BMC Bioinformatics, 7:94, 2006.

[257] C. Stark, B. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers. BioGRID: A general repository for interaction datasets. Nucleic Acids Res., 34:D535–D539, 2006.

[258] J. Storey. A direct approach to false discovery rates. J. R. Statist. Soc. B, 64, Part 3:479–98, 2002.

[259] J. F. Storz, B. A. Payseur, and M. W. Nachman. Genome scans of DNA variability in humans reveal evidence for selective sweeps outside of Africa. Mol Biol Evol, 21(9):1800–11, Sep 2004.

[260] R. A. Sturm and D. L. Duffy. Human pigmentation genes under environmental selection. Genome Biol, 13(9):248, 2012.

[261] K. Styblo. The relationship between the risk of tuberculous infection and the risk of developing infec- tious tuberculosis. Bull Int Union Tuberc Lung Dis, 60(3-4):117–119, 1985.

[262] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 102(43):15545–50, Oct 2005. 202 BIBLIOGRAPHY

[263] P. Supply, R. M. Warren, A.-L. Bañuls, S. Lesjean, G. D. V. D. Spuy, L.-A. Lewis, M. Tibayrenc, P. D. V. Helden, and C. Locht. Linkage disequilibrium between minisatellite loci supports clonal evolution of Mycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol, 47(2): 529–538, Jan 2003.

[264] M. M. Tanaka, A. R. Francis, F. Luciani, and S. A. Sisson. Using approximate Bayesian computation to estimate tuberculosis transmission parameters from genotype data. Genetics, 173(3):1511–1520, Jul 2006.

[265] M. M. Tanaka and N. A. Rosenberg. Optimal estimation of transposition rates of insertion sequences for molecular epidemiology. Statistics in Medicine, 20:2409–2420, 2001.

[266] A. Templeton. Population genetics and microevolutionary theory. Wiley-Liss, 2006.

[267] A. Tenesa, P. Navarro, B. J. Hayes, D. L. Duffy, G. M. Clarke, M. E. Goddard, and P. M. Visscher. Recent human effective population size estimated from linkage disequilibrium. Genome Res, 17(4): 520–6, Apr 2007.

[268] J. A. Tennessen and J. M. Akey. Parallel adaptive divergence among geographically diverse human populations. PLoS Genet, 7(6):e1002127, Jun 2011.

[269] K. M. Teshima, G. Coop, and M. Przeworski. How reliable are empirical genomic scans for selective sweeps? Genome Res, 16(6):702–712, Jun 2006.

[270] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet., 25 (1):25–29, 2000.

[271] E. E. Thompson, H. Kuttab-Boulos, D. Witonsky, L. Yang, B. A. Roe, and A. Di Rienzo. CYP3A variation and the evolution of salt-sensitivity variants. Am J Hum Genet, 75(6):1059–69, Dec 2004.

[272] K. R. Thornton, J. D. Jensen, C. Becquet, and P. Andolfatto. Progress and prospects in mapping recent selection in the genome. Heredity, 98(6):340–8, Jun 2007.

[273] K. R. Thornton and J. D. Jensen. Controlling the false-positive rate in multilocus genome scans for selection. Genetics, 175(2):737–50, Feb 2007.

[274] S. A. Tishkoff, M. K. Gonder, B. M. Henn, H. Mortensen, A. Knight, C. Gignoux, N. Fernandopulle, G. Lema, T. B. Nyambo, U. Ramakrishnan, F. A. Reed, and J. L. Mountain. History of click-speaking populations of Africa inferred from mtDNA and Y chromosome genetic variation. Mol Biol Evol, 24 (10):2180–2195, Oct 2007.

[275] S. A. Tishkoff, F. A. Reed, F. R. Friedlaender, C. Ehret, A. Ranciaro, A. Froment, J. B. Hirbo, A. A. Awomoyi, J.-M. Bodo, O. Doumbo, M. Ibrahim, A. T. Juma, M. J. Kotze, G. Lema, J. H. Moore, H. Mortensen, T. B. Nyambo, S. A. Omar, K. Powell, G. S. Pretorius, M. W. Smith, M. A. Thera, BIBLIOGRAPHY 203

C. Wambebe, J. L. Weber, and S. M. Williams. The genetic structure and history of Africans and African Americans. Science, 324(5930):1035–1044, May 2009.

[276] S. A. Tishkoff, F. A. Reed, A. Ranciaro, B. F. Voight, C. C. Babbitt, J. S. Silverman, K. Powell, H. M. Mortensen, J. B. Hirbo, M. Osman, M. Ibrahim, S. A. Omar, G. Lema, T. B. Nyambo, J. Ghori, S. Bumpstead, J. K. Pritchard, G. A. Wray, and P. Deloukas. Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet, 39(1):31–40, Jan 2007.

[277] P. Tobias. On the increasing stature of the Bushmen. Anthropos, 57(3/6):801–810, 1962.

[278] A. Tong, M. Evangelista, A. Parsons, H. Xu, G. Bader, N. Page, M. Robinson, and et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science, 294:2364–2368, 2001.

[279] A. Tong, G. Lesage, G. Bader, H. Ding, H. Xu, X. Xin, J. Young, G. Berriz, R. Brost, M. Chang, Y. Chen, X. Cheng, G. Chua, H. Friesen, D. Goldberg, J. Haynes, C. Humphries, G. He, S. Hussein, L. Ke, N. Krogan, Z. Li, J. Levinson, H. Lu, P. Menard, C. Munyana, A. Parsons, O. Ryan, R. Tonikian, T. Roberts, et al. Global mapping of the yeast genetic interaction network. Science, 303(5659):808– 813, 2004.

[280] A. G. Tsolaki, A. E. Hirsh, K. DeRiemer, J. A. Enciso, M. Z. Wong, M. Hannan, Y.-O. L. Goguet de la Salmoniere, K. Aman, M. Kato-Maeda, and P. M. Small. Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from genomic deletions in 100 strains. Proc Natl Acad Sci U S A, 101(14):4865–70, Apr 2004.

[281] M. C. Turchin, C. W. K. Chiang, C. D. Palmer, S. Sankararaman, D. Reich, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, and J. N. Hirschhorn. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet, 44(9):1015–9, Sep 2012.

[282] A. Typas, R. Nichols, D. Siegele, M. Shales, S. R. Collins, B. Lim, H. Braberg, N. Yamamoto, R. Takeuchi, B. Wanner, H. Mori, J. S. Weissman, N. J. Krogan, and C. Gross. High-throughput quantitative analyses of genetic interactions in E. coli. Nature Methods, 5(9):781–7, 2008.

[283] M. Urquhart and K. Buckley. Historical statistics of Canada (Toronto). B270, 1965.

[284] J. D. van Embden, M. D. Cave, J. T. Crawford, J. W. Dale, K. D. Eisenach, B. Gicquel, P. Hermans, C. Martin, R. McAdam, and T. M. Shinnick. Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: recommendations for a standardized methodology. J Clin Microbiol, 31(2): 406–9, Feb 1993.

[285] P. Verdu, F. Austerlitz, A. Estoup, R. Vitalis, M. Georges, S. Théry, A. Froment, S. Le Bomin, A. Ges- sain, J.-M. Hombert, L. Van der Veen, L. Quintana-Murci, S. Bahuchet, and E. Heyer. Origins and genetic diversity of pygmy hunter-gatherers from Western Central Africa. Curr Biol, 19(4):312–8, Feb 2009. 204 BIBLIOGRAPHY

[286] C. Viboud, O. N. Bjørnstad, D. L. Smith, L. Simonsen, M. A. Miller, and B. T. Grenfell. Synchrony, waves, and spatial hierarchies in the spread of influenza. Science, 312(5772):447–51, Apr 2006.

[287] P. M. Visscher. Sizing up human height variation. Nat Genet, 40(5):489–90, May 2008.

[288] P. M. Visscher, W. G. Hill, and N. R. Wray. Heritability in the genomics era–concepts and misconcep- tions. Nat Rev Genet, 9(4):255–66, Apr 2008.

[289] A. J. Vogler, C. Keys, Y. Nemoto, R. E. Colman, Z. Jay, and P. Keim. Effect of repeat copy number on variable-number tandem repeat mutations in Escherichia coli O157:H7. J Bacteriol, 188(12):4253– 4263, Jun 2006.

[290] A. J. Vogler, C. E. Keys, C. Allender, I. Bailey, J. Girard, T. Pearson, K. L. Smith, D. M. Wagner, and P. Keim. Mutations, mutation rates, and evolution at the hypervariable VNTR loci of Yersinia pestis. Mutat Res, 616(1-2):145–158, Mar 2007.

[291] B. F. Voight, S. Kudaravalli, X. Wen, and J. K. Pritchard. A map of recent positive selection in the human genome. PLoS Biol, 4(3):e72, Mar 2006.

[292] M. Wade, R. Winther, A. Agrawal, and C. Goodnight. Alternative definitions of epistasis: indepen- dence and interaction. Trends in Ecology and Evolution, 16(9):498–504, 2001.

[293] G. P. Wagner, M. D. Laubichler, and H. Bagheri-Chaichian. Genetic measurement theory of epistatic effects. Genetics, 102(103):569–580, 1998.

[294] B. Waiser, W. Waiser, J. Perret, and J. Perret. Saskatchewan: a new history. Fifth House Publishers, 2005.

[295] J. Waldram, A. Herring, and T. Young. Aboriginal health in Canada: Historical, cultural, and epi- demiological perspectives. University of Toronto Press, 2006.

[296] K. Wang, M. Li, and M. Bucan. Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet, 81(6):1278–83, Dec 2007.

[297] M. N. Weedon, G. Lettre, R. M. Freathy, C. M. Lindgren, B. F. Voight, J. R. B. Perry, K. S. Elliott, R. Hackett, C. Guiducci, B. Shields, E. Zeggini, H. Lango, V. Lyssenko, N. J. Timpson, N. P. Burtt, N. W. Rayner, R. Saxena, K. Ardlie, J. H. Tobias, A. R. Ness, S. M. Ring, C. N. A. Palmer, A. D. Mor- ris, L. Peltonen, V. Salomaa, Diabetes Genetics Initiative, Wellcome Trust Case Control Consortium, G. Davey Smith, L. C. Groop, A. T. Hattersley, et al. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet, 39(10):1245–50, Oct 2007.

[298] D. Wegmann, C. Leuenberger, S. Neuenschwander, and L. Excoffier. ABCtoolbox: a versatile toolkit for approximate Bayesian computations. BMC Bioinformatics, 11:116, 2010. BIBLIOGRAPHY 205

[299] B. S. Weir and C. C. Cockerham. Estimating F-statistics for the analysis of population structure. Evolution, 38(6):1358, 1984. ISSN 0014-3820.

[300] L. Wells. Physical measurements of Northern Bushmen. Man, pages 53–56, 1952.

[301] WHO. Global tuberculosis control: A short update to the 2009 report. Technical report, World Health Organization, Geneva, Global Tuberculosis Control, 2009.

[302] C. O. Wilke and C. Adami. Interaction between directional epistasis and average mutational effects. Proc. R. Soc. Lond. Ser. B Biol. Sci., 268:1469–1474, 2001.

[303] S. H. Williamson, M. J. Hubisz, A. G. Clark, B. A. Payseur, C. D. Bustamante, and R. Nielsen. Localizing recent adaptive evolution in the human genome. PLoS Genet, 3(6):e90, Jun 2007.

[304] C. A. Winkler, G. W. Nelson, and M. W. Smith. Admixture mapping comes of age. Annu Rev Genomics Hum Genet, 11:65–89, Sep 2010.

[305] T. Wirth, F. Hildebrand, C. Allix-Béguec, F. Wölbeling, T. Kubica, K. Kremer, D. van Soolingen, S. Rüsch-Gerdes, C. Locht, S. Brisse, A. Meyer, P. Supply, and S. Niemann. Origin, spread and demography of the Mycobacterium tuberculosis complex. PLoS Pathog, 4(9):e1000160, 2008.

[306] S. Wong, L. Zhang, and F. P. F. Roth. Discovering functional relationships: biochemistry versus genetics. Trends in Genetics, 21(8), 2005.

[307] S. Wright. Evolution in Mendelian populations. 1931. Bull Math Biol, 52(1-2):241–95; discussion 201–7, 1990.

[308] C. H. Wyndham, N. B. Strydom, J. S. Ward, J. F. Morrison, C. G. Williams, G. A. Bredell, M. J. Vonrahden, L. D. Holdsworth, C. H. Vangraan, A. J. Vanrensburg, and A. Munro. Physiological reactions to heat of Bushmen and of unacclimatized and acclimatized Bantu. J Appl Physiol, 19: 885–8, Sep 1964.

[309] G. Yan, G. Zhang, X. Fang, Y. Zhang, C. Li, F. Ling, D. N. Cooper, Q. Li, Y. Li, A. J. van Gool, H. Du, J. Chen, R. Chen, P. Zhang, Z. Huang, J. R. Thompson, Y. Meng, Y. Bai, J. Wang, M. Zhuo, T. Wang, Y. Huang, L. Wei, J. Li, Z. Wang, H. Hu, P. Yang, L. Le, P. D. Stenson, B. Li, et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and chinese rhesus macaques. Nat Biotechnol, 29(11):1019–23, Nov 2011.

[310] J. Yang, B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher. Common SNPs explain a large proportion of the heritability for human height. Nat Genet, 42(7):565–9, Jul 2010. 206 BIBLIOGRAPHY

[311] X. Yi, Y. Liang, E. Huerta-Sanchez, X. Jin, Z. X. P. Cuo, J. E. Pool, X. Xu, H. Jiang, N. Vinckenbosch, T. S. Korneliussen, H. Zheng, T. Liu, W. He, K. Li, R. Luo, X. Nie, H. Wu, M. Zhao, H. Cao, J. Zou, Y. Shan, S. Li, Q. Yang, Asan, P. Ni, G. Tian, J. Xu, X. Liu, T. Jiang, R. Wu, G. Zhou, et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science, 329(5987):75–8, Jul 2010.

[312] l. You and J. Yin. Dependence of epistasis on environment and mutation severity as revealed by in silico mutagenesis of Phage T7. Genetics, 160:1273–1281, 2002.

[313] N. Zaitlen and P. Kraft. Heritability in the genome-wide association era. Hum Genet, 131(10):1655–64, Oct 2012.

[314] L. A. Zhivotovsky. Estimating divergence time with the use of microsatellite genetic distances: impacts of population growth and gene flow. Mol Biol Evol, 18(5):700–709, May 2001.

[315] L. A. Zhivotovsky and M. W. Feldman. Microsatellite variability and genetic distances. Proc Natl Acad Sci U S A, 92(25):11549–11552, Dec 1995.

[316] W. Zhong and P. Sternberg. Genome-wide prediction of C . elegans genetic interactions. Science, 311 (5766):1481–1484, 2006.