Bárbara Domingues Bitarello

"Seleção balanceadora no genoma humano: relevância biológica e consequências deletérias"

“Balancing selection in the genome: biological relevance and deleterious consequences”

Diogo Meyer Orientador

São Paulo 2016

Bárbara Domingues Bitarello

"Seleção balanceadora no genoma humano: relevância biológica e consequências deletérias"

Tese apresentada ao Instituto de Biociên- cias da Universidade de São Paulo, para a obtenção de Título de Doutor em Ciên- cias, na Área de Biologia (Genética).

Orientador: Diogo Meyer

São Paulo 2016

i Ficha Catalográfica Domingues Bitarello, Bárbara "Seleção balanceadora no genoma humano: rele- vância biológica e consequências deletérias". 299 páginas. Tese (Doutorado) - Instituto de Biociências da Universidade de São Paulo. Departamento de Gené- tica e Biologia Evolutiva.

1. Evolução Molecular;

2. Evolução Humana;

3. Seleção Balanceadora;

4. Evolução Adaptativa;

5. Genética de Populações;

6. Genômica de Populações;

7. Carga Genética

I. Universidade de São Paulo. Instituto de Biociências. Departamento de Genética e Biologia Evolutiva.

Comissão Julgadora:

Prof. Dr. Prof. Dr.

Prof. Dr. Prof. Dr.

Prof. Dr. Diogo Meyer

ii À memória de Maria Gabriela Duarte Macêdo (Kikita).

iii “We speak not only to tell other people what we think, but to tell ourselves what we think. Speech is a part of thought.” – Oliver Sacks

iv Agradecimentos Às minhas amigas e amigos atemporais, que eu quase não vejo, mas que sempre torceram junto comigo a cada etapa na vida acadêmica até aqui: Laura Prado, Marcela Combat, Poliana Cardoso, Denise Nogueira, Luciana Matta, Ra- mon Vitral. Quero agradecer à Laura Prado por ter me ouvido e ter dado muitas dicas úteis nos meses finais do doutorado. Tenho o privilégio de ter trabalhado em um ambiente muito agradável (o Porão). Agredeço ao Pato (Guilherme Garcia) por ter me dado várias dicas so- bre formatação da tese, e por ter cedido seu template de LATEX(e à Débora Brandt também): graças a vocês eu pude desenvolver exatamente o que eu queria, sem gastar tempo desnecessário. Agradeço Daniela Rossoni, Bárbara Tafinha, Ana Paula Assis e Anna Penna pela amizade. Ao meu colega Gustavo Franca por ter me dado muitas dicas perto do fim do doutorado, fora o reforço positivo. Agradeço ao Diog(R)o Melo por ter me ajudado com questões de Linux. Final- mente, agradeço a oportunidade de trabalhar com todos dos grupos do profes- sor Gabriel Marroig e da professora Tatiana Torres, bem como outros grupos que participam dos encontros Evolução no Porão. Aos meus colegas do grupo de Genética Evolutiva: sou muito grata a to- dos. Foi um prazer trabalhar com um grupo colaborativo como o nosso e ver o quanto pudemos crescer juntos. Ao Vitor Aguiar por ter me escutado muito (mas muito mesmo) e por sempre ser tão gentil. Ao Jônatas, pela generosidade com seus scripts e por me apresentar a Tia das Massas. Sou grata pelo quanto me ajudou a programar melhor e por sua enorme contribuição com as análises do Capítulo 2. Agradeço o incentivo do Limão aos meus hobbies musicais e por ter ajudado a achar erros nos dados usados no Capítulo 2. À Maria Helena Maia, que foi uma irmã durante meu mestrado e início do doutorado. Agradeço especialmente à Kelly Nunes e à Débora Brandt. À Kelly por ser minha sábia pós-doc de plantão, e sempre ter transmitido calma e entusiasmo quando eu precisei. Obrigada especialmente por ter lido muitas partes da tese com carinho e ter me ajudado muito a aprimorá-la. À Débora Brandt, quero agradecer por sua amizade, generosidade e o quanto me ajudou com a quali- dade dos dados que analisei, além de sua ajuda corrigindo diversos trechos da tese. Algumas outras pessoas que gostaria de agradecer: Caroline Lima, Caro- lina Malcher, Rodrigo dos Santos Francisco. Aos amigos/colegas/colaboradores que fiz em Leipzig: Cesare de Filippo, João Teixeira, Michael Dannemann, André Strauss, Sandra Oliveira, Diana Le- Duc, Fabrizio Manfezzoni, Felix Key, Petra Korlevic. Não apenas pude apren- der com vocês, mas vocês fizeram minha estadia em Leipzig ser melhor. Ao Stéphane Peyregné, que me revelou que trabalhar ouvindo trilha sonora da Dis- ney (agradeço vocês também, Disney) aumenta a produtividade. À Annalisa Schmidt, por ser uma amiga muito presente durante meu ano em Leipzig. Teria sido muito menos legal sem você lá.

v Gostaria de agradecer especialmente à pesquisadora Aida Andrés. Passei um ano com seu grupo no Max Planck Institute for Evolutionary Anthropo- logy, onde aprendi mais do que eu poderia antever. Gostaria de agradecer es- pecialmente pela confiança que teve em mim desde o início, e também por sua calma. Nessa etapa do trabalho eu tive, efetivamente, dois orientadores. É um privilégio que nem todos os alunos de doutorado têm, e sou muito grata. Gostaria de agradecer a alguns professores e/ou membros da minha banca de qualificação, que considero terem contribuído muito para minha formação ao longo de toda a pós-graduação: Tatiana T. Torres, Walter A. Neves, Gabriel Marroig, Paulo Otto. Sou grata ao professor Eduardo Tarazona, que me apre- sentou ao Diogo Meyer. Ao meu orientador, Diogo Meyer, agradeço por ter me ajudado a aprender o máximo possível ao longo desses sete (!) anos de pós-graduação na USP e às oportunidades que me proporcionou. Eu cheguei aqui sem saber muita coisa, exceto que queria estudar genética de populações humanas, e você me propor- cionou estudar exatamente o que eu queria. Concluir essa etapa da minha for- mação é um "sonho"que nutro desde muito jovem, e você foi uma pessoa muito importante ao longo desta trajetória. À Ale Chris, com quem eu pude contar absolutamente sempre que precisei. Agradeço eternamente por tê-la como amiga, e por sua enorme generosidade. Agradeço também à Gisele Melo pela amizade. À Klervia Jaouen, obrigada pela sua confiança inabalável em mim, pela pa- ciência, compreensão e por sua disposição em me ajudar, sempre, seja ouvindo ensaios de apresentação, seja lendo o que eu escrevi (e tudo isso em português, sua terceira, e ainda incipiente, língua). Obrigada por me fazer feliz e querer ser uma pessoa melhor, sempre. À minha avó, Tê, por todo o suporte que sempre me deu. Aos meus pais, Bia e Flávio, e aos meus segundos pais, Beth e Joe: obrigada por serem ótimos exemplos pra mim, todos vocês, e por sempre terem me incentivado a seguir essa carreira. Agradeço à minha mãe por ter lido e comentado a introdução (sei que não foi fácil) e por ter sido compreensiva e sempre ter me ajudado e lidar com a vida acadêmica. Agradeço finalmente à minha irmã, que foi muito com- preensiva com a minha necessidade de reclusão nos últimos meses e sempre menteve um reconfortante interesse pelas coisas científicas e nerds. A todas as pessoas que eu porventura tenha esquecido de agradecer, obrigada. Finalmente, agradeço à Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) por ter me financiado no doutorado, incluindo o periodo que passei em Leipzig.

vi Resumo

Seleção balanceadora é um processo evolutivo que engloba diversos mecanismos: vantagem do heterozigoto, seleção dependente de frequência, pressões seletivas que variam ao longo do tempo ou do espaço, e alguns casos de pleiotropia. O estudo des- ses mecanismos em si foi e ainda é um tópico de grande interesse para os biólogos evolutivos, e moldou o estudo da evolução ao longo do último século. Antes de a te- oria neutra ter sido proposta, acreditava-se que a seleção balanceadora fosse comum. A descoberta de que muita da diversidade genética observada podia ser explicada por evolução neutra motivou, portanto, uma melhor compreensão da seleção balanceadora como um regime seletivo capaz de manter variantes vantajosas nas populações. O estudo da seleção balanceadora, em seus primórdios, foi restrito a organismos que podiam ser manipulados em laboratório. Com o advento de métodos que permi- tiam quantificar a variabilidade genética – tais como a eletroforese de proteínas, se- quenciamento em pequena escala e re-sequenciamento genômico de milhares de indi- víduos –, a variabilidade genética humana passou a ser ativamente estudada e inter- pretada. Diversos estudos buscaram por assinaturas de seleção natural – i.e., padrões de variação genômica deixadas por tais regimes seletivos – e avaliaram seu significado comparando-as com o que seria esperado sob um cenário estritamente neutro. A maior parte desses esforços foram concentrados no estudo da seleção positiva, tida como o principal mecanismo responsável pela evolução adaptativa. Poucos estudos buscaram assinaturas de seleção balanceadora no genoma humano. Isso se deve em parte à escassez de métodos com alto poder para detectar tais assina- turas. Adicionalmente, estudos prévios não analisaram dados em escala genômica, ou se concentraram principalmente nas regiões codificadoras de proteínas. Aqui, nós des- crevemos um método simples e com alto poder para detectar assinaturas de seleção balanceadora. Em humanos, esse método supera outros comumente usados para a de- tecção de tais assinaturas e, em teoria, poderia ser usado para detectá-las em outras espécies, desde que seu poder seja avaliado caso-a-caso através de simulações neutras. Nosso método (“Non-Central Deviation”, NCD) é apresentado em duas versões: NCD2, que requer informação acerca dos polimorfismos da espécie analisada e das substitui- ções entre essa espécie e um grupo externo, e NCD1, que requer apenas informação acerca dos polimorfismos da espécie analisada. Embora em humanos NCD2 supere NCD1, este último pode ser utilizado para espécies para as quais não haja informação de um grupo externo.

vii Quando aplicamos NCD2 a dados humanos, usando chimpanzé como grupo ex- terno, encontramos mais de 200 codificadores de proteínas com forte assinatura de seleção balanceadora, dos quais apenas 1/3 tinha evidência prévia de seleção ba- lanceadora. Encontramos também um enriquecimento para diversas categorias de on- tologia gênica, das quais cerca da metade é relacionada à imunidade. Verificamos que dentre os genes com evidências de seleção balanceadora há um excesso de casos de ex- pressão preferencial em tecidos tais como “adrenal” e “pulmão”, e também um excesso de genes com expressão mono-alélica. No geral, vimos que as regiões selecionadas no genoma humano incluem tanto sítios codificadores quanto regulatórios. Não encon- tramos um excesso de assinaturas de seleção balanceadora em regiões regulatórias, ao contrário do que reportaram outros estudos. Finalmente, encontramos um excesso de polimorfismos não-sinônimos em relação aos sinônimos nos genes selecionados. Tendo documentado a ocorrência de seleção balanceadora no genoma humano e identificado genes que foram potencialmente alvos deste regime seletivo, nós investi- gamos as consequências evolutivas desse processo. Nós partimos da hipótese que a seleção balanceadora sobre um sítio reduz a eficiência com a qual a seleção purifica- dora elimina variantes deletérias em sítios vizinhos. Esse processo é uma consequência do quanto a seleção sobre um loco afeta, através de ligação genética, as frequências de sítios não-neutros adjacentes. Testamos essa hipótese examinando se os genes sob seleção balanceadora apresentam um excesso de variantes deletérias em relação a ex- pectativas derivadas a partir do restante do genoma. Usando três diferentes métricas para determinadas se e/ou o quão deletéria é uma dada variante, identificamos um ex- cesso de variantes deletérias dentro dos genes sob seleção balanceadora, e mostramos que tal padrão não pode ser atribuído a efeitos confundidores. Esse achado mostra que, juntamente com os benefícios associados à variação adaptativa, a seleção balanceadora aumenta o fardo de mutações deletérias no genoma humano. De forma geral, nossos achados sugerem que a seleção balanceadora provavelmente mantém variantes genéticas envolvidas em uma miríade de processos biológicos além da imunidade e que ela foi mais comum no genoma humano do que se acreditava anteriormente, afetando entre 1-8% dos genes codificadores de proteínas, bem como diversas regiões não-codificadoras. Adicionalmente, a seleção balanceadora parece ser importante para a evolução humana não apenas por seu efeito sobre a aptidão, mas também por ter sido uma importante força capaz de moldar a diversidade genética observada atualmente em humanos e a susceptibilidade a doenças.

Palavras-chave: evolução molecular, evolução humana, seleção balanceadora, evolu- ção adaptativa, genética de populações, genômica de populações, carga genética

viii Abstract

Balancing selection is an evolutionary process that encompasses several mecha- nisms: heterozygote advantage, negative frequency dependent selection, selective pres- sure that fluctuates in time or in space, and some cases of pleiotropy. The study of these mechanisms per se has been and still is a topic of great interest for evolutionary biol- ogists, and has shaped the study of evolution throughout the last century. Before the proposition of the neutral theory of molecular evolution, it was believed that balanc- ing selection was pervasive. The realization that much of the observed genetic diver- sity could be explained by neutral evolution thus motivated a better understanding of balancing selection as a selective regime capable of maintaining adaptive variants in populations. The study of balancing selection, in its early stages, was restricted to organisms that could be manipulated in the laboratory. With the advent of methods that allowed quan- tification of genetic variation – such as protein electrophoresis, small scale sequencing and genome-wide re-sequencing of thousands of individuals – human variation started to be actively studied and interpreted. Several studies have looked for signatures of natural selection – i.e., patterns of genomic variation that selective regimes leave in the genome – and evaluated their significance by comparing them to what would be ex- pected under a strictly neutral scenario. Most of these efforts focused on the study of positive selection, thought of as the prime mechanism responsible for adaptive evolu- tion. Only a few studies looked for signatures of balancing selection in the human genome. This is partially due to the paucity of powerful methods to detect its signatures. More- over, previous studies either did not analyze data on genomic scale or focused primarily on protein-coding regions. Here, we describe a powerful and simple method to detect signatures of balancing selection. In , it outperforms other methods commonly used to detect such signatures and could in theory be used for other species, provided that its power is evaluated for each species through neutral simulations. Our method ("Non-Central Deviation", NCD) has two versions: NCD2, which requires polymor- phism information on the ingroup species, as well as divergence information between the ingroup and an outgroup species, and NCD1, which only requires the ingroup information. Although NCD2 is more powerful for humans, NCD1 can be used for species that lack information from an outgroup.

ix When applying NCD2 to human data, using chimpanzee as the outgroup, we found more than 200 protein-coding regions with strong signatures of balancing selection, only 1/3 of which had prior evidence for balancing selection. There was also an enrich- ment for several ontology categories, approximately half of which are related to immunity. We also found that among genes with evidence for balancing selection there was an excess of cases of preferential expression in specific tissues, such as "adrenal" and "lung", and an excess of genes with mono-allelic expression. Overall, we found that selected regions of the genome include both coding and regulatory sites. We failed to find a marked excess of balancing selection in regulatory regions, as reported in previous studies. Finally, we found an excess of nonsynonymous versus synonymous polymorphisms within the selected genes. Having documented the occurrence of balancing selection in the human genome and identified genes which were potential targets of this selective regime, we next in- vestigated evolutionary consequences of this process. We hypothesized that balancing selection acting on a site reduces the efficiency with which purifying selection purges deleterious variants at nearby sites. This process is a consequence of how the dynam- ics of selection at one locus, mediated by linkage, can interfere with the frequencies of adjacent non-neutral sites. We tested this hypothesis by examining if the genes under balancing selection show an excess of deleterious variants with respect to expectations derived from the remainder of the genome. Using three different metrics to determine deleteriousness , we identified a significant excess of deleterious variants within bal- anced genes, and we show that this pattern cannot be attributed to confounding fac- tors. This finding shows that together with the benefits associated with adaptive varia- tion, balancing selection is increasing the burden of deleterious mutations in the human genome. Overall, our findings suggest that balancing selection likely maintains variation in a myriad of biological processes other than immunity and that it has been more com- mon in the human genome than previously thought, affecting between 1-8% of human protein-coding genes, as well as a number of non-protein coding regions. Moreover, balancing selection appears to be important to human evolution not only because of its influence on fitness, but also because it has been an important force shaping current human genetic diversity and susceptibility to disease.

Keywords: molecular evolution, human evolution, balancing selection, adaptive evo- lution, population genetics, population genomics, genetic load

x Sumário

Prólogo...... 1

Introdução Geral4 Seleção Balanceadora: conceito, mecanismos e importância...... 4 Por que estudar os mecanismos de manutenção da variabilidade genética?...... 4 Teoria neutra da evolução molecular...... 8 Mecanismos de manutenção de diversidade adaptativa...... 12 Evolução adaptativa no genoma humano...... 25 Assinaturas de seleção balanceadora...... 26 Seleção balanceadora no genoma humano...... 33 Carga genética induzida por seleção balanceadora...... 37 Carga genética...... 39 Relevância, Questões & Hipóteses...... 46 Relevância...... 46 Questões & Hipóteses...... 49

Bibliografia 52

1 Buscando alvos de seleção balanceadora no genoma humano 61 Considerações Iniciais...... 61 Introduction...... 63 Results...... 66 NCD Method...... 66 Power of the NCD statistics to detect LTBS...... 70 Identifying signatures of LTBS in the human genome...... 78 Reliability of significant and outlier windows...... 79

xi SUMÁRIO SUMÁRIO

Non-random distribution across the genome...... 81 The biological pathways influenced by LTBS...... 81 Overlap of significant windows across populations...... 83 The putative function of balanced SNPs...... 83 The top candidate genes...... 85 Discussion...... 93 NCD Method...... 93 Pervasiveness of LTSB in the human genome...... 93 Protein-coding and intergenic targets...... 94 The frequency of the balanced allele(s)...... 95 The candidate genes...... 95 Conclusions...... 97 Materials and Methods...... 98 Simulations...... 98 Power analyses...... 99 Human population genetic data and filtering...... 100 Identifying signatures of LTBS...... 101 Enrichment Analyses...... 104

References 107 Supplementary Text...... 114 S1 Text: Additional analyses for significant and outlier windows and genes...... 114 S2 Text: A set of significant genes...... 116 S3 Text: Manual verification of reliability of SNPs contained in four of the outlier genes...... 116 Supplementary Tables...... 120 Supplementary Figures...... 144

2 Acúmulo de mutações deletérias em genes que foram alvos de seleção balanceadora de longo prazo em humanos 165 Considerações Iniciais...... 165 Introduction...... 167 Methods...... 173 Population datasets...... 173

xii SUMÁRIO SUMÁRIO

Targets of balancing selection...... 174 Annotation...... 175 Quantifying genetic load...... 176 Re-sampling control SNPs...... 181 Results...... 185 The site frequency spectrum of balanced genes...... 185 Measures of deleteriousness correlate negatively with allelic fre- quency...... 186 Extreme values for HLA SNPs...... 193 Increased nonsynonymous to synonymous SNPs in balanced genes195 Increased proportion of damaging to synonymous SNPs in bal- anced genes...... 197 Increased C-scores in balanced genes...... 198 Discussion and Conclusions...... 203 Increased genetic load in balanced genes...... 203 The challenges of quantifying genetic load...... 205 Sheltered load and hitch-hiking...... 207

References 209

Considerações Finais e Perspectivas 214 Seleção balanceadora no genoma humano...... 214 Desenvolvimento e avaliação de um novo método para a detec- ção de assinatura de seleção balanceadora...... 214 Prevalência de SBLP no genoma humano...... 216 Partilhamento entre continentes...... 218 Características das regiões candidatas...... 219 Variação deletéria em regiões e genes com assinaturas de SBLP..... 225 Perspectivas...... 228 Conciliando assinaturas de seleção e fenótipos...... 228 Potencial das estatísticas NCD em futuros estudos...... 230

Bibliografia 232

Apêndices 234 Apêndice A.1...... 235

xiii SUMÁRIO SUMÁRIO

Apêndice A.2...... 247 Apêndice A.3...... 261 Apêndice A.4...... 277

xiv Prólogo

Existem diversas fontes de evidência da ação da seleção natural no genoma humano. A seleção natural pode ser direcional, aumentando ou diminuindo a frequência de variantes vantajosas ou deletérias (seleção positiva ou negativa, respectivamente) ou balanceadora. A seleção positiva vem sendo amplamente investigada há pelo menos duas décadas, sob a forma de “scans genômicos” e é vista como o mecanismo principal da evolução adaptativa. Estima-se que entre 2-14% do genoma humano foram alvo desse regime seletivo, em diversas escalas de tempo.

Diferentemente do que ocorre para a seleção positiva – a qual diminui a diversidade genética – a seleção balanceadora mantém a diversidade genética nas populações e, apesar de sua relevância, poucos estudos até hoje exploraram seus alvos no genoma humano.

Aqui, eu me propus a explorar a importância da seleção balanceadora ao longo evolução humana: seus alvos no genoma, as propriedades destes alvos e as possíveis consequências deletérias causadas na vizinhança de polimorfismos balanceados.

Primeiramente, busquei fazer um levantamento o mais completo possível dos alvos de seleção balanceadora no genoma humano. Para isso, desenvolve- mos uma nova ferramenta estatística (NCD, Non-Central Deviation), otimizada para a detecção de assinaturas de seleção balanceadora em humanos. Essa ferra- menta foi utilizada em dados genômicos de quatro populações a fim de mapear a atuação desse regime seletivo em humanos: seus alvos, suas propriedades bi- ológicas e genômicas, e as diferenças entre diferentes populações e continentes (Capítulo 1).

1 Em segundo lugar, buscamos testar a hipótese de que a seleção balancea- dora, mantendo polimorfismos por longas escalas de tempo (milhões de anos), teria um efeito deletério sobre sítios próximos ao(s) sítio(s) selecionados. Essa hipótese é oriunda de uma ampla literatura acerca do acúmulo de mutações de- letérias em regiões vizinhas a alvos de seleção positiva, de carga genética em humanos, e também do fato de que muitos genes sob seleção balanceadora pa- recem estar associados a doenças complexas. Além disso, há evidências de tal acúmulo ocorre ao redor dos genes HLA que que foram alvos de seleção balan- ceadora. A fim de explorar essas questões, o primeiro capítulo da tese trata da detecção dos alvos da seleção balanceadora no genoma humano, e o segundo no estudo dos efeitos da seleção balanceadora sobre regiões vizinhas do genoma.

Além disso, apresento uma introdução geral aos temas dos dois capítulos, e uma discussão final sobre as implicações dos achados dos dois estudos. Assim, a tese está dividida em:

• Introdução Geral: uma introdução acerca (a) do regime seletivo conhe- cido como seleção balanceadora – definição, importância histórica para a genética de populações, mecanismos através dos quais atua, proprieda- des do regime seletivo e importância evolutiva – e (b) do conceito de carga genética e dos efeitos deletérios que a seleção natural deixa em regiões li- gadas geneticamente aos alvos de seleção.

• Primeiro capítulo: em que apresento um novo método estatístico (“NCD”), focado especificamente na detecção de assinaturas genômicas deixadas por regimes de seleção balanceadora que perduram por milhões de anos, bem como os resultados de um scan genômico feito com o método descrito usando dados reais de genética de populações humanas.

2 • Segundo capítulo: em que apresento uma investigação acerca do acúmulo de mutações deletérias em regiões que foram alvo de seleção balanceadora em humanos, conforme detectadas no scan genômico apresentado no pri- meiro capítulo.

• Considerações Finais e Perspectivas: em que discuto os principais acha- dos dos dois capítulos principais da tese, bem como as perspectivas para trabalhos futuros que podem decorrer de nossas contribuições.

No anexo incluí publicações que resultaram de colaborações realizadas no período do doutorado. Três delas (Apêndices A.1, A.2, A.4) trazem como tema principal os genes HLA – exemplos clássicos de alvos de seleção balanceadora – , sendo dois deles focados nas assinaturas de seleção balanceadora desses genes (A.2 e A.4), e o outro focado em problemas metodológicos com sequenciamento de nova geração para essa região do genoma (A.1). Finalmente, o trabalho no anexo A.3 trata da evolução adaptativa de forma mais ampla, com foco em tes- tes de seleção positiva.

3 Introdução Geral

Seleção Balanceadora: conceito, mecanismos e importância

NTENDER como surgiram e como foram mantidas as variações fenotí- picas observadas nas populações naturais, bem como suas possíveis E implicações funcionais, é um dos objetivos centrais da genética de populações (Kimura, 1983; Dobzhansky, 1937). Não é surpreendente, portanto, que a elucidação acerca dos mecanismos através dos quais a variabilidade ge- nética ao nível molecular é mantida nas populações seja uma "grande questão": talvez o problema mais importante do campo da biologia evolutiva e da gené- tica de populações (Kimura, 1983), marcado por acirrados debates (Figura1).

Por que estudar os mecanismos de manutenção da variabilidade genética?

O "vigor do híbrido", ou heterose, é observado há séculos. Mendel percebeu que as ervilhas híbridas de seus estudos tinham altura média maior do que as das linhagens parentais (Crow, 1987) e Darwin escreveu um livro inteiro sobre

4 Introdução Geral Adaptada a partir de Gloss e Whiteman ( 2016 ) resu- Linha do tempo do estudo da seleção balancedora Figura 1: mindo as principais contribuições teóricaspara a e manutenção empíricas de para variação a genética. compreensão SBLP, seleção da balanceadora importância de da longo seleção prazo (ver balanceadora texto).

5 Introdução Geral vigor do híbrido em plantas (Darwin, 1876)1. Apesar de ter sido observado frequentemente por criadores de plantas e animais, esse fenômeno só pôde ser corretamente interpretado após a re-descoberta das Leis de Mendel no início do século 20 (revisado em Crow, 1987), quando estabeleceu-se de forma definitiva que a variação genética é um dos fatores determinantes da variação fenotípica.

Em 1922, Fisher menciona que, embora o vigor do híbrido fosse um fato, não era clara a razão biológica pela qual um heterozigoto qualquer seria mais apto que seu homozigoto correspondente. Fisher foi o primeiro a demonstrar como um equilíbrio de frequências entre dois alelos pode ser mantido em um loco sob vantagem de heterozigoto (ver página 13). Ele propôs, ainda, que tal fenômeno deveria ser comum na natureza, e capaz de explicar tanto o vigor do híbrido quanto os efeitos deletérios às vezes observados em animais domesti- cados submetidos a endogamia.

Na medida em que começaram a surgir, desde a primeira metade do século 20, métodos capazes de acessar a variação genética diretamente – uma proprie- dade que, começava-se então a perceber, era aparentemente ubíqua em popula- ções naturais –, começou também a surgir um interesse crescente em se explicar os padrões de variação genética (revisado em Bamshad e Wooding, 2003; Gloss e Whiteman, 2016). Mesmo com métodos que para o geneticista de populações contemporâneo parecem bastante limitados (pois sempre forneciam subestima- tivas dos níveis reais de variabilidade genética das populações), variações ge- néticas já eram observadas desde meados do século 20, e sua persistência foi atribuída à ação da seleção balanceadora (Dobzhansky, 1937).

O termo “seleção balanceadora” agregava todo e qualquer processo evolu-

1Nesse livro, Darwin faz experimentos de auto-polinização e polinização cruzada em mais de 60 espécies de plantas e conclui que (na maior parte dos casos) a performance da prole resultante de auto-polinização é, para diversos traços, inferior.

6 Introdução Geral tivo que, de forma adaptativa, mantinha variação genética nas populações. Por exemplo, Dobzhansky (1937) observou polimorfismos na orientação de longos trechos de DNA em cromossomos de Drosophila – as chamadas “inversões cro- mossômicas” usando técnicas de coloração de cromossomos, muito antes de o sequenciamento de DNA ser possível – e atribuiu à seleção balanceadora a ma- nutenção de tais polimorfismos na natureza (Figura1).

Nessa época, no início do século 20, o interesse pela seleção balanceadora tinha múltiplas origens. Criadores de animais e plantas queriam maximizar a produtividade/performance, e o vigor do híbrido era observado, porém não to- talmente compreendido. Dobzhansky, assim como Muller, estava preocupado com o potencial da espécie para continuar evoluindo. Entretanto, Muller tinha outra preocupação mais “urgente”: o impacto do aumento da taxa de muta- ção causada por radiação nas gerações futuras (revisado em Crow, 1987). Es- sas três preocupações giravam em torno do quanto a variabilidade genética na população depende da sobredominância2, i.e. em loci nos quais o fenótipo do heterozigoto está além dos limites dos fenótipos dos dois homozigotos corres- pondentes3.

Nos anos 60, Lewontin e Hubby (1966) revelaram à comunidade científica que os níveis de variação genética em alozimas de Drosophila eram muito mais altos do que o que se estimava ser a variabilidade genética nas populações até então analisadas (Figura1). Suas descobertas foram baseadas em um novo mé- todo de detecção de polimorfismos – a eletroforese de proteínas –, que permitia quantificar ainda mais variações do que as técnicas de coloração de cromosso-

2Esse termo, usado frequentemente como sinônimo para vantagem do heterozigoto, foi em- pregado originalmente para explicar a heterose em plantas (Hedrick, 2012). 3Essa é uma definição mais antiga de vantagem do heterozigoto (Crow, 1987). Outras serão mencionadas ao longo do texto.

7 Introdução Geral mos utilizadas por Dobzhansky uma década antes. A seleção balanceadora – e, particularmente, a seleção que varia ao longo do tempo – se consagrou então como uma explicação bastante popular para a manutenção de polimorfismos na natureza (revisado em Bamshad e Wooding, 2003; Gloss e Whiteman, 2016).

É importante mencionar que o mecanismo principal de seleção balancea- dora invocado para explicar os níveis de variação em Drosophila era a hetero- geneidade espacial nas pressões seletivas (Levene, 1953), decorrente do fato de tais espécies ocuparem hábitats diversos: uma determinada variante não seria a mais adaptativa em todos os hábitats, assim levando à manutenção de po- limorfismos na população (Figura1). Mais tarde, Dempster (1955) mostrou que pressões seletivas que variam ao longo do tempo também podem manter variações genéticas (revisado em Gloss e Whiteman, 2016).

Nas décadas seguintes, na medida em que modelos matemáticos foram sendo desenvolvidos no sentido de descrever tais regimes seletivos, evidências em- píricas de sua ocorrência na natureza começaram a surgir (i.e, Allison, 1954). Nesse período, acreditava-se amplamente em uma teoria neo-Darwinista bas- tante “selecionista”, i.e, focada em seleção natural como o principal mecanismo capaz de alterar frequências alélicas e fenotípicas: as variantes seriam em sua grande maioria deletérias, e uma proporção menor seria vantajosa (Figura2).

Teoria neutra da evolução molecular

Com a sensibilidade cada vez maior dos métodos moleculares em detectar os ní- veis de variabilidade (i.e, Lewontin e Hubby, 1966), constatou-se a abundância de polimorfismos4 em populações naturais, o que levou a um grande desen-

4A presença de variantes fenotípicas discretas em uma população é chamada de polimor- fismo. Os polimorfismos “visíveis” são, contudo, uma subestimativa da diversidade genética subjacente (Charlesworth e Charlesworth, 2010). Ao longo do texto, polimorfismos são diferen-

8 Introdução Geral

Figura 2: Modelos selecionista, neutro e quase neutro de evolução molecular. Figura adaptada a partir de Bromham e Penny (2003) e Bernardi (2007). Em 1859, Darwin publica o livro "A origem das espécies", no qual descreve suas ideias sobre seleção natural. Darwin acredita que possam haver mutações neu- tras, porém a maioria das mutações são deletérias e poucas são vantajosas. Na primeira metade do século 20 a seleção natural é conciliada com as bases do mecanismo molecular de herança (Neo-Darwinismo). Neste período, o foco selecionista aborda apenas mutações deletérias e vantajosas. Em 1968 Kimura publica a primeira versão do modelo neutro de evolução molecular (e uma atu- alização importante em 1983) e em 1973 Ohta propõe o modelo quase neutro de evolução molecular. Esses dois últimos autores mostram que a deriva genética pode manter variantes neutras ou quase neutras nas populações. Ver também Figura1. volvimento de modelos de seleção balanceadora (Gloss e Whiteman, 2016). A Figura1 resume muitas dessas contribuições.

Entretanto, esse entusiasmo foi contrabalanceado pela então recente “teoria neutra da evolução molecular”. Em seu livro divisor de águas para o campo da biologia evolutiva5, Kimura apresenta à comunidade científica a ideia que a principal causa das mudanças evolutivas no nível molecular – i.e, mudanças tes alelos alelos existentes na população para um dado loco. 5“The Neutral Theory of Molecular Evolution (1983)” é a versão consultada aqui e ampla- mente usada em genética de populações, embora a teoria tenha sido proposta pela primeira vez antes (Kimura, 1968).

9 Introdução Geral no material genético per se – não é a seleção positiva darwiniana, mas a fixação aleatória de variantes neutras ou quase neutras (Figura2).

Desde a primeira proposição da teoria neutra em 1968, Kimura enfrentou muitas críticas, em grande parte devido ao fato de a biologia evolutiva ter sido dominada por mais de meio século pela visão darwinista de que organismos se tornam progressivamente adaptados a seus ambientes pelo acúmulo de varian- tes benéficas (Figura2). Além disso, a teoria recebeu críticas de cunho técnico tais como a dificuldade de conciliar suas predições com alguns aspectos impor- tantes dos dados gerados: variância elevada em taxas de evolução molecular, viés no uso de códons, evidência de seleção em estudo de genes específicos, entre outros. Essas críticas forçaram mudanças na teoria, que foi sendo revi- sada ao longo do tempo (Kimura, 1968; Kimura, 1983; Kimura, 1991). Um outro desdobramento interessante foi a proposição da teoria “quase neutra” que, ao contrário da teoria neutra original, que tratava apenas de polimorfismos neu- tros e os muito deletérios, incorporou também os polimorfismos fracamente deletérios (Ohta, 1973) e os benéficos (Ohta, 1995; Ohta e Gillespie, 1996; Figura 2).

Segundo a teoria neutra, variantes neutras, introduzidas por mutações, po- deriam subir de frequência estocasticamente. Kimura (1968,1983) apontou que, para contribuir para a adaptação, as mutações precisam ser mais do que be- néficas: elas precisam escapar a perda por deriva, especialmente quando raras. Levando ambos os fatores em conta, Kimura concluiu que as mutações de efeito intermediário são as que mais provavelmente contribuem para a adaptação (re- visado em Orr, 2005). Adicionalmente, concluiu que a maior parte da varia- bilidade molecular intraespecífica, como por exemplo aquela manifestada pe- los polimorfismos de proteínas, é essencialmente neutra, de forma que a maior

10 Introdução Geral parte dos alelos polimórficos são mantidos na espécie devido a um balanço en- tre mutação e extinção aleatória de alelos (Kimura, 1983). Ao longo dos anos 60 e 70, por influência das ideias de Kimura, os geneticis- tas evolutivos ficaram cada vez mais convencidos de que muita – se não a maior parte – da evolução molecular reflete a fixação de mutações neutras ou quase neutras, e não benéficas. A teoria parecia ser capaz de explicar, através de pro- cessos estocásticos, a maior parte da variação observada dentro de populações. Nesse período, o estudo teórico dos mecanismos de evolução adaptativa – sele- ção positiva e balanceadora – diminuiu consideravelmente (Orr, 2005) (Figuras 1e2). Contudo, é interessante observar que a teoria neutra não se opõe à noção de que a evolução de forma e função possam ser guiadas por seleção darwiniana, mas destaca um outro aspecto do processo evolutivo ao enfatizar o papel crucial que as pressões mutacionais e a deriva genética possuem no nível molecular. Kimura (1983) define a teoria neutra como:

“(...)the theory that at the molecular level evolutionary changes and poly- morphisms are mainly due to mutations that are nearly enough neutral with respect to natural selection that their behavior and fate are mainly

determined by mutation and random drift.” (Kimura 1983; primeiro ca- pítulo).

A teoria olha os polimorfismos ao nível da proteína e do DNA como fases transitórias da evolução molecular e rejeita a noção de que a maior parte desses polimorfismos seja adaptativo e mantido na espécie devido a alguma forma de seleção balanceadora (Figura2). A teoria neutra prevê, portanto, que a maior parte das variantes genéticas são neutras; ela não rejeita, por outro lado, que as

11 Introdução Geral variantes genéticas funcionais - aquelas que afetam os fenótipos, e que dentro de seu modelo, representam uma minoria da variação existente na natureza - poderiam ser mantidas por seleção balanceadora (Kimura, 1983; Gloss e White- man, 2016).

Com esse histórico – não exaustivo – sobre o tema da manutenção de diversi- dade nas populações, busquei apresentar os conceitos e distinção entre polimor- fismos neutros, mantidos com certa probabilidade simplesmente pela combina- ção dos efeitos de mutação, deriva, “efeito carona”6 e migração, e polimorfismos adaptativos, que são mantidos por um ou mais mecanismos de seleção balance- adora7. A seguir, detalharei cada um desses mecanismos.

Mecanismos de manutenção de diversidade adaptativa

Dentro do pensamento evolutivo clássico não se previa um nível de variação que implicaria na existência de muitos casos de polimorfismos balanceados (Fi- gura1), i.e, aqueles mantidos em frequências intermediárias. Essa visão foi con- testada pela descoberta de polimorfismos balanceados na natureza. Tal desco- berta foi uma importante contribuição que ocorreu nos primórdios da genética de populações e que ajudou muito para uma melhor compreensão da evolução (Charlesworth e Charlesworth, 2010.

Níveis de variação genética, sabe-se hoje, são influenciados por fatores de- mográficos, tais como flutuações no tamanho populacional, estruturação, mis-

6Genetic hitch-hiking, processo através do qual mutações neutras – ou, em alguns casos, dele- térias – mudam de frequência em uma população devido ao efeito de ligação genética com uma mutação selecionada (revisado em Cutter e Payseur, 2013). Esse tópico será abordado em maior detalhe na seção “Carga genética induzida por seleção balanceadora” e também no Capítulo 2. 7Embora a seleção positiva também atue sobre as mutações vantajosas ou adaptativas, ela tende a fixar tais variantes vantajosas na população, e portanto reduz, em vez de manter, a variabilidade genética.

12 Introdução Geral cigenação e migração (Tishkoff e Williams, 2002)8. Por outro lado, padrões de diversidade variam ao longo do genoma por diversos motivos, includindo ta- xas de mutação, taxas de recombinação, conversão gênica9 e processos seleti- vos (Tishkoff e Williams, 2002). A classe de processos seletivos que levam a um aumento de diversidade genética vantajosa é conhecida como “seleção balan- ceadora” (Andrés, 2011; Key et al., 2014). Refere-se a tais polimorfismos como “polimorfismos balanceados” (Charlesworth e Charlesworth, 2010).

Seleção balanceadora é um termo que engloba diversos mecanismos, os quais serão discutidos a seguir.

Aptidões constantes

A modelagem de regimes de seleção balanceadora frequentemente é feita usando- se as aptidões dos genótipos sem que se especifique as causas subjacentes para as diferenças de fenótipo10. Embora seja uma estratégia útil, e na prática muitas das assinaturas genômicas deixadas pelos diferentes mecanismos de seleção ba- lanceadora sejam as mesmas, é importante buscar entender quais aspectos bio- lógicos de um organismo são capazes de determinar sua aptidão (Charlesworth e Charlesworth, 2010).

O modelo mais básico de seleção natural assume aptidões constantes dos genótipos (Figura3A). Para um sistema diploide e bi-alélico, as aptidões dos genótipos A1 A1, A1 A2 e A2 A2 são dadas por w11, w12, e w22, respectivamente. Em sistemas diploides, é comum utilizar o conceito de aptidões marginais dos

8Entretanto, por haver estocasticidade, em média a demografia afeta o genoma como um todo, e não apenas regiões específicas. 9Um tipo específico de recombinação, que resulta em uma troca não-recíproca de material genético, em que uma fita de DNA é usada para modificar a sequência de outra (Tishkoff e Williams, 2002). 10Por exemplo, ver a seção de métodos do Capítulo 1.

13 Introdução Geral

alelos A1 e A2 para nos referirmos a uma média da aptidão de cada alelo con- siderando todos os genótipos em que ele aparece (i.e, e heterozigoto).

No cenário que estamos tratando aqui, embora w11, w12 e w22 não mudem ao longo do tempo, as aptidões marginais continuam dependendo das frequên- cias alélicas. Ou seja, é possível que as duas aptidões marginais se igualem, e que as frequências alélicas parem de mudar, atingindo um equilíbrio estável de frequências. Chamamos a frequência de cada alelo, no equilíbrio, de frequência de equilíbrio11. Interessantemente, os valores absolutos das aptidões dos genótipos são irrelevantes: apenas os valores relativos dos genótipos entram nas equações de seleção. Pode-se, portanto, definir um genótipo (e.g. o heterozigoto) como sendo o "padrão", e expressar as aptidões dos outros genótipos em relação a este, como apresentado a seguir (Charlesworth e Charlesworth, 2010).

Vantagem do heterozigoto Seguindo o modelo anterior, e usando frequências relativas dos genótipos, um cenário de vantagem do heterozigoto contendo dois

12 alelos A1 e A2 com frequências p e q pode ser modelado da seguinte forma :

A1 A1 A2 A2 A2 A2 1 − t 1 1 − s

, onde t e s são os coeficientes seletivos dos alelos A1 e A2, respectivamente.

Quando as duas aptidões marginais, de A1 e A2, são iguais (qt = ps), tem-se um equilíbrio polimórfico, e a frequência de equilíbrio é dada por:

t peq = t+s ; qeq = 1 − peq

11No caso de locos bi-alélicos, a frequência de equilíbrio pode ser definida como a frequência do alelo menos frequente. 12A primeira demonstração e discussão de como um polimorfismo pode ser mantido por seleção, de forma bastante semelhante à apresentada aqui, foi feita no trabalho entitulado "On the dominance ratio", de Fisher (1922). Ver a Figura1.

14 Introdução Geral

Figura 3: Uma possível estratégia para a identificação de instâncias de seleção balanceadora consiste em medir diferenças de aptidão entre classes genotípicas. Em (A), temos um cenário de sobredominância ou vantagem do heterozigoto, onde as aptidões dos genótipos são constantes mas a do heterozigoto é sempre mais alta que a de ambos os homozigotos. Nesse exemplo, os homozigotos têm aptidões distintas (sobredominância assimétrica). Em (B), temos um cenário de aptidões variáveis ao longo do espaço (por exemplo, diferentes hábitats ocu- pados por uma espécie) ou no tempo (quando um genótipo homozigoto tem maior aptidão em um dado momento, e reduz em gerações subsequentes). Fi- gura adaptada de Key et al. (2014).

Através de vantagem do heterozigoto (também comumente chamada de so- bredominância), polimorfismos podem ser mantidos em uma população devido à maior aptidão do genótipo heterozigoto em relação aos dois genótipos homo- zigotos, o que leva a um balanço de frequências entre as duas variantes (Andrés, 2011; Fijarczyk e Babik, 2015; Key et al., 2014). A frequência de equilíbrio será 0.5 quando s = t, um cenário improvável, conhecido como sobredominância simétrica (Figura4A), e 6= 0.5 quando s 6= t (Figura4B). Desde que a condição qt = ps seja atendida, a frequência de equilíbrio é atingida independentemente de sua frequência inicial (Figura4), i.e., o equilíbrio é estável. No Capítulo 1, propomos uma estatística sumária que utiliza muitas dessas propriedades

15 Introdução Geral para detectar assinaturas de seleção balanceadora em dados genômicos (ver, por exemplo, Figuras 1 e 6 no Capítulo 1).

Um desdobramento interessante do modelo acima é que, supondo-se que um alelo A2, inicialmente raro, entra em uma população inicialmente fixada para A1 (por migração ou por mutação), então a proporção de genótipos A2 A2 produzida por reprodução aleatória (q2) é muito baixa comparada com a de

13 heterozigotos . Como genótipos A2 A2 são inicialmente muito raros, a única condição compatível com um aumento de frequência de A2 é que A1 A2 tenha aptidão mais alta que A1 A1, mesmo que indivíduos A2 A2 tenham aptidão muito reduzida. Assim, A2 só vai aumentar de frequência até um certo ponto, uma vez que A2 A2 é deletério. Com aptidão constante dos genótipos (Figura3A), a seleção balanceadora atuando sobre um só loco sempre maximiza a aptidão média de uma população com reprodução aleatória, ainda que, no caso de vantagem do heterozigoto, ho- mozigotos com aptidões mais baixas sejam gerados por segregação mendeliana a cada geração (Charlesworth e Charlesworth, 2010).

Devido a essa propriedade, Sellis et al. (2011) demonstraram, teoricamente, que é provável que esta seja a trajetória da maior parte das mutações adapta- tivas em diploides. Esse estado “balanceado” seria, por definição, muito fre- quente e de curta duração, e não deixaria assinaturas passíveis de serem cap- tadas pelos métodos focados em assinaturas de seleção balanceadora de longa duração. Por outro lado, teria a vantagem de manter diversidade adaptativa nas populações – mesmo que a curto prazo – o que compensaria a perda de aptidão causada pelos homozigotos. Esse fenômeno poderia ser responsável por man- ter diversidade adaptativa nas populações, que por sua vez poderia ser usada

13Pois p ≈ 1 e 2pq ≈ 2q

16 Introdução Geral

A. 1

0.75 Frequência de 0.5 equilíbrio: 0.5 0.25 Frequência alélica Frequência 0 Tempo

B.

1

0.75 Frequência de 0.5 equilíbrio: 0.3 0.25 Frequência alélica Frequência 0 Tempo

Figura 4: Independente da frequência inicial de cada alelo em um sistema bi- alélico com vantagem de heterozigoto, um equilíbrio pode ser atingido (desde que o alelo sobreviva a perda por deriva). Se as aptidões relativas dos dois ge- nótipos homozigotos em relação ao heterozigoto são idênticas, o equilíbrio é atingido em uma frequência de 0.5 para cada alelo (A). Se um dos genótipos ho- mozigotos tem aptidão relativa maior do que o outro, a frequência de equilíbrio atingida será diferente 0.5 (B, onde a frequência de equilíbrio é de 0.3 para um alelo e 0.7 para o outro, consequentemente). No Capítulo 1, um exemplo real de sobredominância assimétrica é apresentado (Página 68).

prontamente pela seleção natural em casos de mudança repentina de pressão seletiva14. Embora seja uma proposição muito interessante, faltam evidências de empíricas para apoiá-la ou rejeitá-la.

Aptidão média Seria lógico pensar que, em um cenário de vantagem do he- terozigoto, a inevitável geração de homozigotos a cada geração por segregação mendeliana levaria a uma redução da aptidão média da população. Entretanto,

14Esse fenômeno, que vem sendo cada vez mais estudado, é chamado de selection on standing variation.

17 Introdução Geral esse não é o caso: a aptidão média de uma população (i.e, considerando todos os genótipos e suas frequências) com reprodução aleatória e vantagem do he- terozigoto atinge o seu máximo no equilíbrio. Por isso, diz-se que a frequência de equilíbrio em um cenário de vantagem do heterozigoto é aquela que maximiza a aptidão média da população (Charlesworth e Charlesworth, 2010; Andrés, 2011). Portanto, embora a presença de homozigotos com baixa aptidão no caso da ane- mia falciforme, por exemplo, seja muito prejudicial para o indivíduo, a aptidão da população como um todo é mais alta quando indivíduos resistentes à ma- lária são mantidos na população (Charlesworth e Charlesworth, 2010; Wright, 1937).

Seleção antagonista Pressões seletivas opostas entre diferentes contextos – am- bientes, sexo do indivíduo, componentes de aptidão e estágios de desenvolvi- mento – podem gerar seleção antagonista ao nível genético-populacional (Con- nallon e Clark, 2013; Prout, 2000). Alelos selecionados de forma antagonista são aqueles que aumentam a aptidão em um contexto, mas diminuem-na em outro. Quando os contextos são componentes individuais de aptidão, tem-se um ce- nário conhecido como "pleiotropia antagonista", em que um mesmo loco afeta mais de um caráter, e para um deles um alelo é adaptativo, e para o outro, é deletério (Connallon e Clark, 2013).

Suponhamos dois componentes de aptidão: fertilidade e sobrevivência. Se os efeitos do homozigoto para uma dada mutação atuam em direções opostas nos dois caráteres, e o alelo favorável em relação a cada caráter é dominante sobre o alelo deletério, o resultado será um cenário essencialmente indistin- guível daquele de vantagem de heterozigoto, em que o loco pode evoluir para ter frequências intermediárias através de seleção balanceadora. Tal cenário foi

18 Introdução Geral descrito pela primeira vez em um trabalho sobre evolução do envelhecimento (Williams, 1957), em que se propôs que se um alelo causa alta fertilidade na juventude e envelhecimento precoce, o segundo efeito seria compensado pelo primeiro (revisado em Charlesworth, 2000).

Entretanto, a caminhada até o equilíbrio seria lenta, requirindo dezenas de milhares de gerações, mesmo com coeficientes seletivos não muito baixos (Con- nallon e Clark, 2013), o que implica que: (1) alelos próximos do equilíbrio de- vem ser relativamente antigos (ver discussão sobre escalas de tempo, abaixo); e (2) que as populações devem em geral estar longe do equilíbrio para loci sujeitos a esse regime (Connallon e Clark, 2013).

Exemplos de vantagem do heterozigoto Como na prática é muito difícil esta- belecer inequivocamente qual (ou quais) mecanismo(s) de seleção balanceadora é responsável pela manutenção de um polimorfismo balanceado durante longas escalas de tempo, existem poucos exemplos não controversos de vantagem do heterozigoto nessa escala de tempo (Charlesworth e Charlesworth, 2010).

Os genes do MHC (HLA em humanos15) são um possível exemplo de vanta- gem do heterozigoto. Heterozigotos para alelos de genes codificadores de mo- léculas apresentadoras de antígenos teriam a capacidade de responder a um re- pertório maior de antígenos, e, portanto, responder melhor a infecções (Doherty e Zinkernagel, 1975). Existe, entretanto, uma vasta literatura que mostra as dificuldades de se diferenciar os mecanismos atuantes sobre os genes MHC e suas contribuições relativas16. O cenário mais provável é que diversos mecanis- mos contribuíram, em diferentes pontos do espaço e do tempo, para a evolução

15Major histocompatibility complex e Human leukocyte antigen, respectivamente. 16Ver Introdução e Discussão do artigo no Apêndice A.4 para uma discussão detalhada sobre este tópico.

19 Introdução Geral desse sistema, e possivelmente concomitantemente. Por outro lado, é imprová- vel que a vantagem do heterozigoto, apenas, seja capaz de explicar o número de alelos encontrados nos genes do MHC (De Boer et al., 2004).

Em escalas de tempo mais recentes (milhares de anos), o exemplo mais clás- sico é o da β - hemoglobina defeituosa que acarreta em anemia falciforme ao portador da mutação, e sua relação com proteção à malária17. Muitos outros exemplos de vantagem do heterozigoto são conhecidos com base em mensura- ções de aptidão de diferentes genótipos, por exemplo em animais domesticados – tais como porcos, cachorros, cavalos, gatos e ovelhas. Em muitos casos, tra- ços selecionados por seleção artificial e que resultam em fenótipos desejáveis, geram como consequência fenótipos indesejados nos homozigotos recessivos (Hedrick, 2012). Um exemplo extremo é o da ausência de cauda no gato da raça Manx, que é o fenótipo heterozigoto (selecionado); em homozigose, a mutação é letal (Hedrick, 2012).

Aptidões não constantes

Na prática, a premissa de aptidões que são constantes ao longo do tempo é pouco realista: a variabilidade genética pode ser mantida ativamente mesmo na ausência de qualquer forma de vantagem do heterozigoto (Figura3B).

Dependência de frequência negativa Assumimos novamente os genótipos

A1 A1, A1 A2 e A2 A2, e que A1 é dominante, de forma que genótipos A1/− têm aptidão diferente de A2 A2. Se A1/− tem aptidão maior do que A2 A2 quando

A1 é raro, mas a aptidão diminui na medida em que A1 aumenta de frequência, tem-se um cenário de seleção com dependência de frequência (também cha-

17Ver também Página 68.

20 Introdução Geral mada de vantagem do alelo raro). Ou seja, a aptidão marginal de um alelo é negativamente correlacionada à sua frequência na população, o que leva a um balanço estável das frequências de cada variante, onde nenhuma das duas é eliminada18 (Clarke, 1962; Charlesworth e Charlesworth, 2010). Aqui, a manu- tenção do polimorfismo não requer vantagem do heterozigoto.

Como nesse caso as aptidões dependem das frequências dos genótipos, o ar- gumento acima sobre a frequência de equilíbrio ser aquela que maximiza a ap- tidão média da população não procede neste caso: aqui, o equilíbrio estável não precisa, por definição, coincidir com o a aptidão média máxima (Charlesworth e Charlesworth, 2010).

Exemplos de dependência de frequência É provável que cenários como aque- les descritos acima sejam bastante comuns em sistemas naturais. Um exemplo é a seleção apostática, em que predadores exercem pressão seletiva sobre suas pre- sas ao “aprenderem” enquanto caçam: esse cenário favorece presas com fenóti- pos raros (Clarke, 1962; Charlesworth e Charlesworth, 2010). Outro exemplo é o mimetismo batesiano, onde uma espécie modelo perigosa, ou impalatável, é mimetizada por outra espécie (não perigosa), que é sujeita a predação. Variantes raras da espécie mimética, que porventura se pareçam com a espécie modelo, se beneficiam por serem evitadas por predadores, e aumentam de frequência; esse aumento de frequência aumenta a chance de um predador comer a espécie mimética e associar seu padrão a um gosto agradável (Charlesworth e Char- lesworth, 2010), assim diminuindo sua vantagem.

É importante salientar que esses exemplos não levam inevitavelmente a um polimorfismo balanceado: isso depende da relação entre aptidão e frequências

18Em geral, ao longo do texto, assume-se que um loco permite duas variantes, ou seja, é bi- alélico.

21 Introdução Geral dos genótipos. Mais ainda, ao contrário do que ocorre na vantagem do hete- rozigoto, o equilíbrio é instável: análises teóricas apontam que nesse tipo de regime seletivo a população oscilará permanentemente em torno da frequência de equilíbrio (Charlesworth e Charlesworth, 2010).

Outro exemplo importante é o sistema de auto-incompatibilidade: plantas com capacidade de se auto-polinizar costumam ter os chamados locos “S”, que compõem, junto com os genes do MHC em vertebrados, um dos sistemas mais polimórficos em termos de número de alelos (Charlesworth e Charlesworth, 2010). Nos exemplos mais simples, uma planta com grãos de pólen S1 não pode fertilizar óvulos que também carregam alelos S1; assim, todas as plantas são heterozigotas por definição, e um novo alelo que surge tem uma enorme vantagem seletiva, pois o pólen que o carrega poderá fertilizar qualquer planta da população. Nesse cenário, um equilíbrio poderia ser atingido se todos os he- terozigotos tivessem a mesma frequência (Charlesworth e Charlesworth, 2010).

Finalmente, outro exemplo é o da “corrida armamentista” entre hospedeiros e parasitas. No contexto de imunidade adquirida, um tipo comum de parasita provavelmente terá infectado uma grande proporção dos hospedeiros, que se tornam imunes a novas infecções. Assim, parasitas com novos tipos de antíge- nos têm uma vantagem dependente de frequência, acarretando altos níveis de polimorfismos na população de parasitas (Trachtenberg et al., 2003; Borghans et al., 2004; Slade e McCallum, 1992; Spurgin e Richardson, 2010).

É provável que parte da diversidade observada nos genes MHC seja man- tida desta forma, uma vez que diferentes variantes dos genes que codificam as moléculas apresentadoras de antígenos são capazes de apresentar um reper- tório determinado de epítopos, e novas variantes que surgem no hospedeiro podem ter uma vantagem dependente de frequência (Trachtenberg et al., 2003;

22 Introdução Geral

Borghans et al., 2004). Essa hipótese é corroborada pela diversidade particular- mente alta na fenda apresentadora de antígeno das moléculas MHC (Garrigan e Hedrick, 2003) e por uma correlação entre diversidade de patógenos e varia- bilidade no MHC humano (Prugnolle et al., 2005), mas, conforme mencionado anteriormente, o debate acerca dos mecanismos responsáveis pela diversidade dos genes MHC permanece acirrado.

Aptidões que variam no tempo Podemos imaginar cenários em que um alelo tem má performance em uma geração, e boa na geração subsequente (Figura 3B). Um bom exemplo aqui são as flutuações sazonais observadas por Dobzhansky (1937) em inversões cromossômicas de D. pseudoobscura ao longo das estações do ano. Mais recentemente, Bergland et al. (2014) detectaram centenas de loci (SNPs) em D. melanogaster cujas frequências variam dramaticamente entre es- tações do ano19, e argumentam que tais polimorfismos estão sujeitos a forte pressão seletiva, variável ao longo do tempo, dado que eles estão associados a fenótipos que variam entre as estações e que tais polimorfismos respondem a variações climáticas.

De fato, a teoria prevê que “polimorfismos protegidos” (Charlesworth e Charlesworth, 2010; Prout, 1968) podem permanecer em uma população com aptidões que variam ao longo do tempo, desde que a média geométrica das ap- tidões do heterozigoto seja mais alta que a de ambos os homozigotos ao longo do espaço de tempo considerado (Bergland et al., 2014; Charlesworth e Char- lesworth, 2010; Gillespie e Langley, 1974)20.

19Cerca de 10 gerações por verão, e duas por inverno (Bergland et al., 2014). 20Entretanto, no estudo de Bergland et al. (2014) modelos mais realistas foram considerados, que levam em conta a possibilidade de gerações que se sobrepõem, múltiplos loci ligados e uma combinação de variações espaciais e temporais nas pressões seletivas (todos compatíveis com D. melanogaster), uma discussão além dos objetivos desta introdução.

23 Introdução Geral

Aptidões que variam no espaço Um polimorfismo protegido também pode ser mantido em ambientes que variam ao longo do espaço (Figura3B), sob as condições previstas no modelo de Levene (1953) (ver Figura1), se: (a) a aptidão relativa dos genótipos varia entre diferentes nichos ocupados pela espécie; (b) toda a seleção ocorre dentro de cada nicho, e (c) a reprodução ocorre aleatoria- mente entre os indivíduos de cada nicho.

Com alguma abstração, podemos também incluir aqui casos em que exis- tem aptidões diferentes para um dado genótipo considerando os dois sexos: se alelos A1 e A2 têm efeitos opostos sobre a aptidão de machos e fêmeas (anta- gonismo sexual), polimorfismos podem ser mantidos na ausência de vantagem do heterozigoto21. Em D. melanogaster, estima-se que 8% dos genes têm padrões compatíveis com esse tipo de seleção (Innocenti e Morrow, 2010).

No caso de pressões seletivas que variam ao longo do espaço ocupado por uma espécie, a existência ou não de uma frequência de equilíbrio depende não apenas da relação entre as aptidões dos diversos alelos entre os ambientes, mas da proporção de indivíduos alocados a cada ambiente e da quantidade de fluxo gênico entre os ambientes (Gloss e Whiteman, 2016; Charlesworth e Charlesworth, 2010). Sendo assim, nesse mecanismo a manutenção de polimor- fismos a longo-prazo pode ou não ocorrer.

Em muitos dos casos documentados de seleção balanceadora de curto e longo prazo (i.e, que não são detectados com base em estudos de geração atual), é difícil ou impossível determinar qual dos mecanismos acima definidos é res- ponsável pelo padrão observado. Ou seja, a detecção de assinaturas genômicas compatíveis com regimes seletivos de seleção balanceadora não permite dife-

21Ver também Página 18, onde o cenário de pleiotropia antagonista foi definido junto com o de vantagem de heterozigoto, pois ambos os mecanismos resultam em assinaturas indistinguí- veis.

24 Introdução Geral renciar qual mecanismo é responsável pelo padrão observado (ver discussão no próximo item). Além disso, múltiplos mecanismos podem ter atuado simulta- neamente ou em diferentes momentos ao longo da história evolutiva (Hedrick, 2012), como é o caso do MHC.

Evolução adaptativa no genoma humano

“It is easy to invent a selectionist explanation for almost any specific observation; proving it is another story. Such facile explanatory excesses

can be avoided by being more quantitative.” (Kimura, 1983)

Durante décadas, o esforço de detectar casos de seleção positiva através do exame individual de genes candidatos resultou em alguns casos bem documen- tados, como o dos genes associados à persistência da enzima lactase em adultos (e.g. Bersaglieri et al., 2004) e à pigmentação da pele em humanos (Jablonski e Chaplin, 2010). Até recentemente, essa abordagem era a única forma prática de encontrar alvos de seleção positiva em humanos (Sabeti et al., 2006).

Nos últimos 15 anos, abordagens genômicas tornaram-se populares, e houve uma explosão de scans genômicos22 em busca de assinaturas de seleção positiva no genoma humano (Bamshad e Wooding, 2003; Bustamante et al., 2005; Enard et al., 2010; Fay et al., 2001; Nielsen, 2005; Sabeti et al., 2006; Sabeti et al., 2007). Até 2009, 21 scans para seleção positiva em humanos já haviam sido feitos (re- visado em Akey, 2009).

Abordagens genômicas têm a vantagem de possibilitar a compreensão do

22Em um scan genômico, uma estatística de interesse é calculada em janelas do genoma, e destaca-se aquelas que estão nos extremos empíricos da estatística e/ou além de algum limiar definido com base em simulações neutras.

25 Introdução Geral impacto da seleção natural ao longo de todo o genoma e de inferir categorias funcionais mais sujeitas à ação da seleção positiva em humanos (Sabeti et al., 2006). Graças à grande quantidade de scans genômicos em busca de assinaturas de seleção positiva, sabe-se hoje que muitos dos genes com tais assinaturas estão relacionados com certas categorias: “imunidade e defesa”, “percepção senso- rial” e “imunidade mediada por células T”, “gametogênese”, “espermatogênese e mobilidade” (Harris e Meyer, 2006; Nielsen et al., 2005). Para as regiões não- codificadoras, destacam-se processos biológicos como “neurogênese”, “outras atividades neuronais” e “desenvolvimento muscular” (Haygood et al., 2010).

Assinaturas de seleção balanceadora

Desde a proposição da teoria neutra, muitos testes para seleção balanceadora no passado recente23 foram desenvolvidos. Por usarem o modelo neutro como modelo nulo, testes usados para detectar assinaturas de seleção natural são tam- bém chamados de "testes de neutralidade"24.

Seleção atual

Classicamente, a detecção de instâncias de vantagem do heterozigoto foi restrita a estudos da geração atual, que focam em desvios das proporções genotípicas esperadas sob certas premissas (Hardy-Weinberg, panmixia) ou em relações en- tre genótipo/alelo e aptidão (Figura5). Tais estudos são de extrema importân- cia no sentido de avaliar aptidões de fenótipos e comprovar a relação entre o fenótipo selecionado e o genótipo subjacente, mas não focaremos neles aqui.

23Na literatura, frequentemente usa-se a expressão “seleção balanceadora de curto prazo” (últimos milhares de anos). 24Ver página8.

26 Introdução Geral

Figura 5: HW, Hardy-Weinberg, refere-se às proporções p2 + 2pq + q2 = 1 para os genótipos. Figura adaptada a partir de Hedrick (2012). As assinaturas e os testes de neutralidade correspondentes são abordados no item “Assinaturas de seleção balanceadora”.

Seleção de curto prazo

Em relação a eventos de seleção ocorridos no passado “recente” (até cerca de 1 milhão de anos atrás; Fu e Akey, 2013), um dos primeiros testes do modelo neutro foi o de Ewens-Watterson (e.g. Watterson, 1978), ou teste de homozigose da amostra (Figura5). Esse teste assume o modelo IAM (modelo de infinitos alelos25), e que a população encontra-se em equilíbrio entre mutação e deriva. O teste relaciona a expectativa de número de loci homozigotos esperados em uma população em equilíbrio com aquela que de fato é observada. Uma deficiência de homozigose pode ser interpretada como indício de seleção balanceadora (i.e.

25Um modelo que postula que cada novo alelo que surge em uma população é “novo” ou “único”, i.e, diferente de todos os que surgiram antes. Esse modelo foi proposto por Kimura e Crow (1964) em uma tentativa de estimar a proporção de loci homozigotos em uma população diploide finita.

27 Introdução Geral os alelos segregam em frequências intermediárias, diminuindo a homozigose observada). Esse teste foi muito influente em estudos prévios à era genômica, mas hoje em dia é preterido por outros, que têm maior poder.

Outros testes com poder para detectar assinaturas de eventos de seleção ba- lanceadora compatíveis com essa escala de tempo olham: (a) a distribuição de frequências alélicas observada, comparando com aquela esperada sob o modelo neutro, (b) a variação genética e desequilíbrio de ligação26 em certas regiões genômicas com as observadas em regiões evoluindo de forma neutra, (c) a di- ferenciação geográfica observada em certos loci com aquelas encontradas para marcadores neutros (Hedrick, 2006; Hedrick, 2012; Mitchell-Olds et al., 2007) (Figura5).

A desvantagem de testes focados em desequilíbrio de ligação (Figura5) é que essa assinatura é essencialmente indistinguível daquela deixada por varre- duras seletivas incompletas (seleção positiva). Uma outra assinatura que pode ser observada ao nível populacional é que diferentes subpopulações terão pouca diferenciação entre elas para loci-alvo de seleção balanceadora, dado que ela mantém variantes em frequências intermediárias em ambas as populações. Essa assinatura depende de uma congruência entre regimes seletivos entre os ambi- entes ocupados pelas duas populações e de ausência de adaptação local de uma ou mais subpopulações (Figura5).

Por outro lado, com a disponibilidade de dados de sequência, tornou-se pos- sível investigar o efeito cumulativo da seleção ao longo de diversas gerações. O sinal de seleção no passado recente é gerado ou perdido ao longo de dezenas

26Uma medida que reflete se dois alelos em dois diferentes loci coexistem de forma não-neutra em uma população. Alelos em desequilíbrio de ligação são encontrados mais frequentemente no mesmo haplótipo do que seria esperado se a recombinação ocorresse livremente entre eles (revisado em Cutter e Payseur, 2013).

28 Introdução Geral a milhares de gerações, dependendo da influência de deriva genética, fluxo gê- nico e recombinação (Hedrick, 2012).

Seleção de longo prazo

O sinal de seleção que começou no passado distante e durou muito tempo27 é determinado principalmente por mutação e seleção (Figura5), e geralmente leva de milhares a milhões de gerações para ser gerado. Quando persiste por milhares a milhões de gerações, a seleção balanceadora resulta não apenas na manutenção de maior quantidade de alelos nas populações (Andrés et al., 2009), mas também em uma maior persistência, ao longo do tempo, da diversidade alélica em relação à variação neutra (Richman, 2000). Em outras palavras, em casos de seleção balanceadora de longo prazo, alelos segregando para um dado loco têm um TMRCA28 mais longo, o que implica em determinadas assinaturas genômicas: excesso de polimorfismos na região do sítio selecionado (pois ha- verá mais tempo para que mutações ocorram no haplótipo) e uma manutenção de tais polimorfismos por mais tempo que o esperado para uma mutação neu- tra, que acaba por ser fixada ou perdida após, em média, 4N gerações (onde N é o tamanho da população).

Relação entre taxas de substituição não-sinônimas e sinônimas De acordo com a teoria neutra (Kimura, 1968; Kimura, 1983), mutações capazes de alte- rar a função de uma proteína (mutações não-sinônimas) geralmente são deleté- rias e, portanto, alvo da seleção purificadora. Análises comparativas apontam

27À qual nos referiremos como seleção balanceadora de longo prazo, SBLP, e em humanos corresponde a eventos que ocorreram há mais de um milhão de anos, ainda que possam ter persistido até recentemente (Fu e Akey, 2013). 28Time to most recent common ancestor, tempo de coalescência até o ancestral comum mais re- cente.

29 Introdução Geral que ao menos 38% das mutações não-sinônimas sejam deletérias (Eyre-Walker e Keightley, 1999). Já as mutações sinônimas, que não alteram a sequência de aminoácidos da proteína, evoluiriam de forma neutra29. A evolução adaptativa pode levar a um aumento da taxa de substituição de mutações não-sinônimas (dN), tornando-a mais alta do que a taxa de substituição sinônima (dS).

Nesse sentido, a razão dN/dS > 1 (ou ω > 1) é uma assinatura genética de seleção positiva (Gillespie, 1991; Nielsen, 2005), mas também de seleção balan- ceadora (e.g. Bitarello et al., 201530) (Figura5). Entretanto, o critério de ω > 1 para considerar que genes estejam sob evolução adaptativa é muito conserva- dor. Partindo da premissa de que a maior parte das mutações não-sinônimas é deletéria (Kimura e Crow, 1963; Kimura, 1968; Eyre-Walker e Keightley, 1999), o critério muitas vezes não é atendido quando genes inteiros são analisados. Isso ocorre porque geralmente apenas alguns códons estão sob seleção positiva ou balanceadora, enquanto a maior parte das mutações não-sinônimas são dele- térias e, portanto, estão sob seleção purificadora31. Por isso, há algum tempo convencionou-se analisar subconjuntos de códons em busca de seleção (e.g. Hughes e Nei, 1988; Hughes e Nei, 1989; Bitarello et al., 2015) ou através de mo- delos que estimam diferentes valores de dN/dS para grupos de códons (Yang e Swanson, 2002; Bitarello et al., 2015), tornando possível inferir quais deles evoluíram adaptativamente.

Espectro de frequências alélicas Evidências de seleção do passado distante podem também ser baseadas na distribuição de frequência dos alelos de uma

29Embora algumas mutações sinônimas possam ser alvo de seleção devido ao viés no uso de códons e uma parcela das mutações não-sinônimas ser neutra, a premissa é válida devido às proporções (e.g. Comeron et al., 2008). 30Esta referência está disponibilizada no Apêndice A.4. 31Essa ideia é indiretamente explorada no Capítulo 2.

30 Introdução Geral amostra, o espectro de frequências alélicas (SFS, site frequency spectrum). O SFS é uma contagem do número de mutações que existem em uma frequência de xi = i/n para i = 1,2,...,n − 1 em uma amostra de tamanho n, onde x é a frequência relativa de cada contagem. Em outras palavras, o SFS sumariza as frequências alélicas das várias mutações presentes em uma amostra (Nielsen, 2005). Muitos testes estatísticos em genética de populações usam informação acerca da proporção de SNPs que são comuns ou raros, e acerca da frequência de alelos derivados e ancestrais, a fim de fazer inferências sobre a história de- mográfica e possíveis regimes seletivos. Um dos testes dentro dessa categoria é o D de Tajima (Tajima, 1989; Mitchell-Olds et al., 2007) (Figura5). No Capítulo 1, nós propomos dois novos testes que exploram essa assinatura32.

Relação entre níveis de polimorfismo e divergência Outros testes para se- leção balanceadora nessa escala de tempo se baseiam em comparações entre os níveis de polimorfismo (observados dentro de uma espécie) e os níveis de divergência (entre duas espécies)33. Aqui inclui-se o teste Hudson-Kreitman- Aguadé, ou HKA (Hudson et al., 1987), usado para testar predições do modelo neutro de evolução molecular. Um excesso de polimorfismo em relação a diver- gência pode ser interpretado como evidência de seleção balanceadora (Figura 5).

Uma versão modificada do teste HKA – o teste de McDonald-Kreitman (MK) – compara polimorfismos e divergência entre diferentes classes de mutação, como as substituições sinônimas e não-sinônimas (McDonald e Kreitman, 1991).

32As figuras 1 e 5 do Capítulo 1 mostram espectros de frequência alélica esquemático e de dados reais, respectivamente. 33Ao longo do texto, e particularmente no Capítulo 1, refiro-me a substituições entre huma- nos e chimpanzés embora em princípio possa se tratar de substituições entre quaisquer duas espécies.

31 Introdução Geral

Nas situações em que há um excesso de divergência não-sinônima, temos um padrão consistente com seleção positiva (que fixa diferenças entre espécies e remove polimorfismos, explicando o padrão descrito). Já um excesso de poli- morfismos não-sinônimos pode ser interpretado como uma assinatura de sele- ção balanceadora (Figura5). Entretanto, conforme discutido por Eyre-Walker (2006), essa estatística é muito sensível a mudanças de tamanho populacional, e mesmo aumentos modestos de Ne podem criar evidências espúrias de evolução adaptativa. Sella et al. (2009), por sua vez, argumentam que uma das premis- sas mais problemáticas do teste MK é a de que a fração de novas mutações que é neutra, que é estimada a partir dos dados de polimorfismo de uma das espécies, tenha permanecido constante durante a história evolutiva das duas espécies sendo comparadas. A razão pela qual essa premissa é problemática é que uma história demográfica que resulta em uma população atual fora de equilíbrio torna essa premissa falsa quando a seleção é fraca, resultando em es- timativas erradas de taxa de evolução adaptativa (por exemplo, Eyre-Walker, 2006; Fay et al., 2001; Nielsen, 2005; Sella et al., 2009).

Partilhamento de polimorfismos entre espécies Finalmente, se um polimor- fismo é mantido por seleção balanceadora por um tempo suficientemente longo, o mesmo polimorfismo pode ser encontrado em duas espécies-irmãs: um poli- morfismo trans-específico (Hedrick, 2012; Klein et al., 1998) (Figura5).

Se a possibilidade de que mutações idênticas tenham ocorrido independente nas duas espécies puder ser descartada – por exemplo requirindo-se que mais de um polimorfismo seja compartilhado dentro de um haplótipo de tamanho reduzido (e.g. Leffler et al., 2013; Teixeira et al., 2015) – , a outra possível expli- cação é que o polimorfismo seja neutro e mantido “ao acaso” nas duas espécies.

32 Introdução Geral

Se o tempo de divergência entre as espécies é muito maior do que o tempo médio de coalescência intra-específica, essa alternativa tem baixíssima probabi- lidade.

Teixeira et al. (2015), por exemplo, demonstraram que a probabilidade de um polimorfismo em humanos ser também um polimorfismo em chimpanzés e bonobos é da ordem de 10−10. Mesmo supondo independência entre todos os SNPs, considerando amostras de 20 indivíduos de cada espécie, a probabilidade de haver um polimorfismo partilhado é de ∼ 0.00005, ou seja, praticamente zero. Portanto, embora essa assinatura seja extremamente convincente como evidência de seleção balanceadora de longo prazo, ela é bastante rara.

Seleção balanceadora no genoma humano

“Balancing selection is not unique to the human lineage, nor is it the dominant force in shaping the human genome. It is, however, there and its

effects are without dispute.” (Vallender e Johnson, 2008)

Embora diversos scans genômicos tenham sido feitos com o intuito de lo- calizar alvos de seleção positiva (revisado em Akey, 2009), poucos trabalhos, comparativamente, buscaram localizar alvos de seleção balanceadora. Em parte isso é devido às dificuldades de detecção desse tipo de seleção em escala genô- mica (Andrés et al., 2009). A Figura6 resume os estudos que buscaram por assinaturas de seleção balanceadora em humanos.

Até muito recentemente, não se havia estabelecido mais que alguns poucos casos de seleção balanceadora em humanos. Mesmo com o advento de dados de sequência pra diversos genes, poucos alvos foram propostos além dos genes

33 Introdução Geral

HLA, do gene ABO e do gene da hemoglobina S (Allison, 1954; Asthana et al., 2005; Bubb et al., 2006; Hughes e Nei, 1988; Ségurel et al., 2013).

Figura 6: Resumo dos estudos que buscaram assinaturas de seleção balancea- dora de longo prazo em humanos. SFS, espectro de frequências alélicas; dbSNP, banco de dados de SNP, indica que o estudo não usou informações de frequên- cias de SNPs estimadas para populações individuais, mas de um banco de da- dos comum; trSNPs, SNPs trans-específicos entre (H-C) humano-chimpanzé ou (H-C-B) humano, chimpanzé, bonobo. a, excesso de SNPs em relação a sítios divergentes entre humanos e chimpanzé, uma assinatura de seleção balancea- dora. ** apesar de que eles encontram 16 regiões com alta diversidade genética além de genes HLA e o gene ABO, os autores as descartam como reais candida- tas pois não possuem evidência de polimorfismo trans-específico, o que é um critério muito rigoroso (ver “Assinaturas de seleção balanceadora”.)

O primeiro estudo em escala genômica que identificou novas assinaturas de seleção balanceadora em humanos foi o de Andrés et al. (2009). Nele, foi feito um scan para genes sob seleção balanceadora de longo prazo (SBLP) no genoma humano. O scan de Andrés et al. utilizou uma base de dados de exoma consis- tindo de 13.400 genes de duas populações humanas (19 norte-americanos com ancestralidade europeia e 20 norte-americanos com ancestralidade africana). O método consistiu em contrastar padrões de polimorfismo em cada gene com o resto do genoma e com as expectativas neutras, obtidas através de simulações

34 Introdução Geral parametrizadas pelos próprios dados de sequenciamento. O método buscava genes com duas assinaturas de SBLP: excesso de variantes em frequências in- termediárias e excesso de sítios polimórficos em humanos (em relação a substi- tuições entre humanos e chimpanzés)34.

Apenas genes extremos em relação a ambas as assinaturas, i.e, altamente significativos para dois testes independentes, foram considerados como candi- datos. Apesar do baixo poder estatístico deste tipo de abordagem (Fijarczyk e Babik, 2015), ela tem poucos falsos positivos. Foram encontrados 60 genes com assinaturas de SBLP, muitos deles envolvidos com a imunidade, mas não restritos aos exemplos clássicos até então descritos (Key et al., 2014).

Foi observado que a maior parte dos loci sob seleção balanceadora eram compartilhados entre as duas populações, com poucas exceções : quatro genes com evidência de seleção balanceadora apenas nos americanos com ascendên- cia africana e nove apenas nos americanos com ascendência europeia.

A ausência de sequências flanqueadoras não-codificadoras, intergênicas e intrônicas nas bases de dados desse estudo tem duas consequências: primeiro, os efeitos da seleção balanceadora sobre regiões ligadas a loci sob seleção não puderam ser quantificados; segundo, alvos de seleção balanceadora em regiões não-codificadoras não puderam ser detectadas.

Mais recentemente, DeGiorgio et al. (2014) desenvolveram dois novos testes para a identificação de padrões locais de diversidade esperados em posições li- gadas a um polimorfismo balanceado (T1 e T2) para a identificação e usaram esses métodos em dados genômicos de duas populações (YRI, Yoruba, uma po- pulação africana e CEU, uma população de norte-americanos do Utah com as- cendência norte-europeia). Eles identificaram 200 genes candidatos (Figura6)

34Ambas as assinaturas servem de base para testes discutidos na seção anterior.

35 Introdução Geral

– alguns deles conhecidos (e.g. Andrés et al., 2009), e outros novos (Key et al., 2014).

Três importantes limitações destes trabalhos são que: (1) apesar de o traba- lho mostrar que os métodos T1 e T2 têm poder elevado sob modelos simples, os autores não exploram modelos demográficos humanos (e.g. Gravel et al., 2011); (2) os 200 genes reportados provêm de uma lista dos “100 genes mais ex- tremos” para cada teste e população, mas não foi estabelecido um critério para que 100 genes fossem reportados; portanto, esse trabalho, apesar de inques- tionavelmente contribuir muito para o conhecimento acumulado de regiões do genoma humano que possuem assinaturas de SBLP, não fornece uma estimativa aproximada do quão frequente ela pode ser no genoma humano; (3) apesar de os autores terem usado dados de genoma completo, eles reportam como alvos apenas genes codificadores de proteínas, e não exploram ou reportam regiões genômicas candidatas que estão fora dos limites gênicos.

Finalmente, dois estudos em escala genômica buscaram por polimorfismos partilhados entre humanos e outras espécies: trata-se da assinatura mais ex- trema de SBLP, pois requer a manutenção de polimorfismos por milhões de anos. Leffler et al. (2013) olharam o genoma completo de humanos e chim- panzés e buscaram por polimorfismos compartilhados, encontrando seis genes com forte evidência de polimorfismos balanceados compartilhados (Figura6), e Teixeira et al. (2015) olharam os éxons de humanos, chimpanzés e bonobos, encontrando forte evidência de polimorfismo balanceado partilhado entre as três espécies no gene LAD1, além de alguns genes HLA (Figura6).

Eventos de seleção balanceadora de curto e longo prazo deixam assinaturas genômicas que não permitem diferenciar entre os mecanismos de seleção ba- lanceadora expostos anteriormente. Além disso, um sinal de seleção de longo

36 Introdução Geral prazo não significa que a seleção tenha perdurado até o passado recente ou até a geração atual. Vê-se, portanto, que definir escalas de tempo dos regimes seletivos é importante tanto no sentido de determinar quais as ferramentas ade- quadas, quanto no grau de resolução que se pode alcançar.

Por si só as análises de genética de populações acima descritas são capazes de identificar genes que evoluíram de forma não-neutra durante milhares ou até milhões de gerações. Essa é o tipo de investigação proposta no Capítulo 1, a fim de melhor compreender o impacto da seleção balanceadora de longo prazo no genoma humano. Tal tipo de investigação não fornece, todavia, informação conclusiva acerca do caráter fenotípico que foi alvo de seleção (Mitchell-Olds et al., 2007). Trata-se de um desafio em andamento na genômica de populações, e que será discutido no item Discussão e Considerações Finais (Página 214).

Carga genética induzida por seleção balanceadora

“Hence there must be a far greater number of different kinds of ailments whose characteristics are traceable to genetic changes of natural origin than there are different kinds of infectious diseases. This general reasoning does not in itself give us much idea, however, of the actual frequencies with which these mutational disorders occur in populations. For this it is necessary to

turn to quantitative studies.” (Muller, 1950)

Como vimos, os polimorfismos de sequência podem estar evoluindo de forma neutra (provavelmente a grande maioria deles), mas podem também represen- tar estados transientes de variantes genéticas rumo à eliminação por serem de-

37 Introdução Geral letérias, ou rumo à fixação por serem vantajosas (Hudson et al., 1987; Mitchell- Olds et al., 2007; Sellis et al., 2011). Finalmente, uma certa quantidade desses polimorfismos é mantida em populações individuais por seleção balanceadora ou ao longo de toda a distribuição da espécie, por adaptação local35 (revisado em Mitchell-Olds et al., 2007).

Estudos genômicos têm documentado grandes quantidades de polimorfis- mos, e tem-se feito muito progresso no sentido de compreender quais processos evolutivos moldaram tal variação (revisado em Mitchell-Olds et al., 2007). Para um geneticista evolutivo, detectar possíveis alvos de seleção natural no genoma é fundamental. Primeiramente, padrões compatíveis com um cenário de evo- lução adaptativa (seleção positiva ou balanceadora) devem ser buscados. Um caminho são as assinaturas de seleção e os testes de neutralidade, vistos anteri- ormente. Uma vez detectadas as regiões de interesse, é necessário investigar se há base biológica para sustentar que tais padrões de fato resultem de seleção. Finalmente, pode-se investigar se a seleção sobre um dado loco no genoma tem consequências deletérias sobre regiões vizinhas.

A existência de assinaturas de seleção positiva ou balanceadora indica que pode ter havido adaptação para algum traço (aquele relacionado à pressão se- letiva), mas a consequência da pressão seletiva pode incluir também mudanças que causam má-adaptação, em função de mudanças que ocorrem em outras re- giões do mesmo gene que está sendo selecionado, ou mesmo em regiões não codificadoras – porém funcionais – adjacentes.

35Situação na qual genótipos de diferentes populações têm aptidão maior em seus ambientes de origem, devido a seleção natural histórica na região.

38 Introdução Geral

Carga genética

A compreensão da base genética da adaptação requer a definição de “carga ge- nética”. O conceito foi discutido pela primeira vez por Haldane (1937), sendo posteriormente elaborado por Muller (1950). A ideia central é que uma popu- lação fica aquém de sua aptidão máxima por dois motivos: (1) a ocorrência de mutações deletérias recorrentes (carga mutacional) e; (2) a produção de homo- zigotos menos aptos, nas situações em que os heterozigotos são o genótipo mais apto (carga segregacional). Esse decréscimo na aptidão máxima de uma popu- lação é a carga genética.

Embora as mutações deletérias incorram em um custo de aptidão, elas nem sempre são removidas da população de forma eficiente. É ainda mais difícil

36 remover mutações deletérias de populações pequenas (baixos valores de Ne ), e seu acúmulo pode levar a reduções no tamanho populacional e, em última instância, à extinção (Chun e Fay, 2011).

No âmbito da evolução molecular, a maior parte das mutações não são adap- tativas, piorando (mutações mal-adaptativas) ou não interferindo (neutras) no grau de adaptação dos caráteres ao ambiente (Orr, 1998). A eficácia de remoção de variantes deletérias de uma população depende de vários fatores: mutação (que cria novas variantes deletérias constantemente), dominância (que influ- encia o quanto a mutação é “visível” para a seleção) (e.g. Sellis et al., 2011), demografia e ligação (Gravel, 2016).

Além disso, uma situação de má-adaptação de uma população pode ser cau-

36Tamanho populacional efetivo: reflete o tamanho de uma população idealizada que estaria sujeita à deriva da mesma forma que a população de fato. O Ne pode ser menor que o tama- nho real da população devido a vários fatores, incluindo variância no sucesso reprodutivo, uma história demográfica com gargalos genéticos (reduções extremas de tamanho populacional, se- guida de uma expansão a partir de uma amostra da população original) e endogamia (revisado em Cutter e Payseur, 2013).

39 Introdução Geral sada por falta de variação genotípica segregante para responder à seleção. De- riva genética e endogamia, por exemplo, removem as populações de seus picos adaptativos e podem levar à má-adaptação fenotípica (Crespi, 2000). A pleiotro- pia37 pode resultar em populações mal-adaptadas pois a otimização conjunta de muitos caracteres é inviável (Charlesworth e Charlesworth, 2010; Crespi, 2000). Finalmente, a migração entre populações de indivíduos que se adaptaram em diferentes subpopulações pode também levar à má-adaptação (Charlesworth e Charlesworth, 2010; Crespi, 2000).

A má-adaptação pode, ainda resultar de pressões seletivas para a adapta- ção em sítios ligados, que discutirei em maior detalhes abaixo. Nesse sentido, seja em termos fenotípicos, seja em termos genéticos (acúmulo de mutações de- letérias), o termo má-adaptação alude a um “custo adaptativo” com o qual as populações têm que arcar em função de estarem evoluindo sob determinadas pressões seletivas.

Efeitos de ligação sobre polimorfismos neutros

A última década documentou uma explosão de estudos sobre a prevalência e efeito da seleção natural, em particular no genoma da nossa espécie. Um dos achados mais marcantes foi o fato de haver uma proporção relativamente grande de mudanças entre humanos e chimpanzés que resultam da seleção na- tural. Por exemplo, alguns estudos mostraram que até 10% das substituições que carregamos resultam de seleção positiva (Bustamante et al., 2005; Fay et al., 2001), uma fração muito maior do que seria esperado por um neutralista (revisado em Eyre-Walker, 2006). Esses resultados foram obtidos usando uma

37Fenômeno em que um gene afeta múltiplos caracteres, considerado o modo quase universal de atuação gênica.

40 Introdução Geral abordagem que contrasta o grau de polimorfismo e divergência entre huma- nos e primatas38, e revelou que há mais diferenças fixas não-sinônimas entre as duas espécies do que seria esperado sob neutralidade – diferenças essas que são explicáveis se supusermos que a seleção fixou mutações vantajosas diferentes entre as linhagens de humanos e chimpanzés (revisado em Eyre-Walker, 2006).

É esperado que a seleção natural deixe assinaturas específicas sobre os pa- drões de variação neutra intimamente ligados ao sítio com a mutação vanta- josa. Essa ideia é a base dos métodos de genética molecular de populações que buscam por adaptações no genoma humano (Kreitman e Di Rienzo, 2004). Essa propriedade é crucial para que testes de neutralidade tenham poder para detec- tar regiões com assinaturas de seleção, ainda que o sítio selecionado seja apenas um (revisado em Charlesworth, 2006). A informação genética em escala genô- mica permite também fazer inferências sobre consequências da seleção natural sobre regiões do genoma fisicamente próximas aos genes ou regiões genômicas selecionados(as).

Existe também uma bem-documentada correlação positiva entre taxas de recombinação e níveis de polimorfismo em Drosophila (Zhang e Parsch, 2005) e em humanos (Hellmann et al., 2003). Essa correlação é consistente com a ideia de que a seleção positiva ocorre em diversos locais do genoma, e quando afeta um gene numa região de baixa recombinação, “arrasta” com ele um uma parte do cromossomo (isto é, promove um evento de carona genética), que tem como consequência a perda da variação naquela região do genoma. Nesse processo, a variação neutra em sítios ligados é reduzida. E quanto menor for a taxa de recombinação na região, mais pronunciada será a perda de diversidade.

38Conforme discutido na página 31 e na Figura5.

41 Introdução Geral

Estas varreduras seletivas39 levam a uma queda de diversidade em torno do sítio selecionado, que, com o passar do tempo, vai progressivamente aumen- tado na medida em que aumenta a distância entre o sítio neutro e o sítio-alvo. O tamanho da área vizinha que perderá polimorfismos neutros depende da in- tensidade da seleção, da taxa de mutação e da taxa de recombinação (revisado em Bamshad e Wooding, 2003). Sabe-se hoje que a seleção positiva em huma- nos afeta bastante os sítios neutros próximos ao sítio selecionado, reduzindo em 6% seu nível de polimorfismo ao longo de todo o genoma, e 11% na porção codificadora de proteínas do genoma (Cai et al., 2009).

Analogamente, a seleção balanceadora de longo prazo também afeta sítios neutros vizinhos (Charlesworth, 2006). Por aumentar o tempo de coalescência, a seleção balanceadora leva a um aumento de diversidade em torno do sítio selecionado, além de mudar o formato do espectro de frequências alélicas local (SFS), que passa a ter um excesso de alelos segregando em frequências próximas à do polimorfismo balanceado (Andrés et al., 2009; Andrés, 2011; Bamshad e Wooding, 2003; Charlesworth, 2006).

Entretanto, ao contrário das varreduras seletivas, que envolvem altos coe- ficientes seletivos e diminuem a diversidade neutra em relativamente poucas gerações, gerando assinaturas que se estendem por longos trechos do cromos- somo (Bamshad e Wooding, 2003), a seleção balanceadora de longo prazo, por envolver escalas de tempo de milhares a milhões de anos, gera assinaturas em curtos segmentos em torno do polimorfismo balanceado40. Isso ocorre porque, ao longo de muitas gerações, a recombinação tem a oportunidade de ir "que-

39Fenômeno em que uma mutação recém-surgida e altamente adaptativa sobe rapidamente de frequência na população. 40Uma propriedade que é explorada quando avaliamos o poder de nossas estatísticas no Ca- pítulo 1.

42 Introdução Geral brando"a ligação entre o polimorfismo balanceado e os sítios neutros vizinhos (Andrés, 2011; Charlesworth, 2006), assim reduzindo o efeito de ligação a seg- mentos curtos do cromossomo.

Efeitos de ligação sobre polimorfismos não-neutros

Existem dois modos através dos quais a seleção sobre um traço interfere so- bre a seleção sobre outros traços. O primeiro se dá em condições em que um gene tem funções pleiotrópicas. Seleção positiva ou balanceadora pressupõe adaptação para algum traço (aquele relacionado à pressão seletiva), mas a con- sequência da pressão seletiva em termos de fixação de mutações pode não ser uma adaptação para todas as possíveis funções que exerce. Nesse caso, o “sub- produto” seria uma má-adaptação. Como exemplo, temos os genes HLA, que estão relacionados à resposta imune em humanos e têm fortes evidências de se- leção positiva e balanceadora. Por outro lado, muitas doenças inflamatórias e autoimunes também estão relacionadas aos genes de HLA (Becker et al., 1998).

O segundo modo ocorre quando o sítio selecionado interfere sobre o destino de mutações não-neutras em sítios ligados geneticamente41. A seleção natural não atua independentemente sobre locos ligados e, sendo assim, sua eficiência está diretamente relacionada à taxa de recombinação (Hill e Robertson, 1966; Comeron et al., 2008). A ligação entre sítios reduz a eficácia da seleção natural em populações finitas (Hill e Robertson, 1966; Comeron et al., 2008). Mantendo- se as outras variáveis constantes, espera-se que o grau de interferência entre alelos selecionados varie entre regiões com diferentes taxas de recombinação, potencialmente levando a mudanças nas taxas de evolução42. Regiões de baixa

41Ver Figura 1 do Capítulo 2 na página 170. 42Hill-Robertson Effect é o nome dado a essa interferência (Hill e Robertson, 1966; Comeron et al., 2008).

43 Introdução Geral recombinação estariam sujeitas a efeitos mais pronunciados de Hill-Robertson, dado que a baixa recombinação reduz a independência entre os sítios, aumen- tando o efeito relativo da deriva sobre a região – o que equivale a uma redução do Ne local (Maynard-Smith e Haigh, 1974; Comeron et al., 2008).

Diante da informação de que substituições adaptativas (fixadas por seleção positiva) são relativamente comuns, tornou-se importante investigar a forma como esses eventos influenciam a atuação da seleção natural em regiões adja- centes do genoma. As trajetórias até a fixação das variantes neutras ligadas a variantes vantajosas permanecem inalteradas – dado que a probabilidade de uma mutação vantajosa carregar consigo uma mutação neutra é diretamente proporcional à frequência do alelo na população (Birky e Walsh, 1988; Chun e Fay, 2011; Comeron et al., 2008; Harris, 2010; Charlesworth, 2012). Entretanto, mutações levemente deletérias, quando ligadas geneticamente a variantes van- tajosas, terão probabilidade de fixação maior do que a esperada sob o modelo neutro (Lynch, 2007).

O efeito da seleção positiva sobre regiões ligadas em Drosophila foi inves- tigado por Betancourt e Presgraves (2002), que verificaram que os genes de evolução adaptativa lenta (sujeitos principalmente à seleção purificadora) se concentram principalmente nas regiões de baixa recombinação. Além disso, foi encontrada uma correlação negativa significativa entre a taxa de evolução (me- dida em termos de dN) e a proporção de uso de códons ótimos43. Os autores descartaram a possibilidade de restrição seletiva relaxada; eles concluem que a causa para essa correlação é a interferência do sítio codificador sobre o fraco coeficiente seletivo para uso ótimo de códons em sítios vizinhos fortemente li-

43Diferentes sítios sinônimos são seletivamente diferentes, uma vez que certos códons (ditos “ótimos”) são utilizados mais frequentemente que outros, possivelmente em função de eficiên- cia e precisão da tradução (Betancourt e Presgraves, 2002).

44 Introdução Geral gados. A conclusão geral é que, ao menos em Drosophila, parece haver uma “hie- rarquia de pressões seletivas”: a seleção contra mutações deletérias é mais forte do que a seleção sobre mutações não-sinônimas vantajosas, que por sua vez é maior do que a seleção para códons ótimos. O que não significa, conforme os próprios autores salientam, que o efeito cumulativo de muitos códons subóti- mos seja negligenciável. Além disso, foi demonstrado que a seleção direcional forte (positiva ou ne- gativa) é capaz de gerar um aumento da proporção de variantes deletérias se- gregando em regiões adjacentes àquelas que foram alvo da seleção direcional em humanos (Chun e Fay, 2011). Assim como os scans genômicos para seleção balanceadora são muito menos abundantes que aqueles para seleção positiva, o mesmo ocorre em relação ao acúmulo de deletérias: até o momento, nenhum estudo investigou especificamente o impacto que a seleção balanceadora tem sobre o acúmulo de variantes deletérias em sítios na vizinhança do polimor- fismo balanceado. Existe evidência de que genes na vizinhança dos genes HLA têm um ex- cesso de variantes potencialmente deletérias (Mendes, 2013; Lenz et al., 2016). Entretanto, dadas as várias particularidades dessa região genômica (Meyer et al., 2006), permanece em aberto qual o efeito que a seleção balanceadora tem, em geral, sobre variantes não-neutras ligadas. No Capítulo 2 eu abordo essa questão.

45 Introdução Geral

Relevância, Questões & Hipóteses

Relevância

Na última década, com a implementação cada vez mais frequente de scans genô- micos, foram geradas várias listas de regiões e genes do genoma humano e de outras espécies que têm assinaturas de seleção positiva. O acesso a sequências de genomas de diversas espécies com ferramentas bioinformáticas estimulou a quantificação da evolução adaptativa, que resulta nos padrões de polimorfismo observados em humanos. O fato de grupos de genes definidos com base em sua função em categorias tais como “espermatogênese”, “olfação”, “percepção sensorial” e “resposta imune” serem recorrentes nas listas de genes candidatos à ação da seleção positiva (e.g. Nielsen, 2005; Sabeti et al., 2006) é algo que, in- trinsecamente, é provido de sentido biológico: muitos genes nessas categorias estão diretamente envolvidos em interações com o ambiente. Tais observações aumentam nossa confiança em relação aos scans genômicos, ao mesmo tempo em que nos ajudam a compreender as adaptações específicas de nossa espécie. Por ser considerada amplamente como o principal mecanismo responsável pela evolução adaptativa, a seleção positiva foi e é intensamente estudada.

A variabilidade genética é o objeto de estudo do geneticista de populações. O regime seletivo conhecido como seleção balanceadora engloba uma série de mecanismos capazes de manter polimorfismos adaptativos segregando nas po- pulações por curtos ou longos períodos de tempo. Se por um lado hoje te- mos um mapa abrangente de genes que sofreram seleção positiva em huma- nos, existe uma deficiência de informação acerca dos alvos e da abrangência da seleção balanceadora na história evolutiva humana.

46 Introdução Geral

Um dos motivos é histórico, conforme exposto na introdução: se outrora foi um regime seletivo que entusiasmou gigantes da biologia evolutiva como Dobzhansky e Fisher, a seleção balanceadora deixou de ser o foco de pesquisa dos biólogos evolutivos por mais de duas décadas após o advento da teoria neutra de Kimura (1968) e Kimura (1983). De certa forma, a demonstração de Hughes e Nei (1988) de que a seleção balanceadora mantém níveis de diversi- dade especificamente na fenda apresentadora de antígenos de genes do MHC – compatível com um a ação da seleção balanceadora – marca um renascimento do interesse por esse tópico.

O MHC segue sendo o melhor e menos controverso exemplo de seleção ba- lanceadora em humanos, bem como um exemplo de que múltiplos mecanis- mos podem atuar em um mesmo sistema: existem evidências independentes de vantagem do heterozigoto, seleção dependente de frequência e pressão se- letiva variável ao longo do tempo e/ou do espaço atuando sobre esses genes, e um acirrado debate acerca da proporção com que cada mecanismo contribui para os padrões de alta diversidade do MHC.

Ainda assim, a defasagem de conhecimento que temos acerca dos alvos de seleção balanceadora no genoma humano em relação aos de seleção positiva, e as propriedades entre esses regimes seletivos, é enorme. Entre os poucos estudos que buscaram assinaturas de seleção balanceadora, existem diversas limitações incluindo: uma falta de poder para detectar as assinaturas, dados insuficientes (poucas populações ou restritos a , por exemplo). Dos que buscaram por assinaturas em escala genômica, um estudo discute apenas os al- vos codificadores de proteínas – ainda que todo o genoma tenha sido analisado (DeGiorgio et al., 2014) – e o outro (Leffler et al., 2013) documenta que a maior parte dos alvos é não-gênico, i.e, possivelmente regulatórios ou relacionados à

47 Introdução Geral expressão gênica. Entretanto, este último usou um critério muito estringente na determinação de alvos de seleção balanceadora (a presença de polimorfismos compartilhados com chimpanzé), de forma que faltam estudos que apoiem ou contestem essa observação.

A detecção de assinaturas de seleção natural é possível, em grande parte, devido ao sinal produzido sobre sítios neutros adjacentes aos sítios alvo de se- leção. O método desenvolvido no Capítulo 1 (NCD, Non-Central Deviation) é um exemplo de utilização dessa propriedade: mesmo que apenas um sítio seja efetivamente mantido polimórfico por seleção balanceadora44 , ele irá afetar ní- veis de polimorfismo neutro adjacentes, por efeito de ligação, e a extensão desse efeito sobre a vizinhança depende da escala de tempo, duração e intensidade de seleção.

Por outro lado, pouco se sabe sobre o efeito que polimorfismos balancea- dos têm sobre variantes não-neutras adjacentes. Esse conhecimento é escasso mesmo para alvos de seleção positiva, e praticamente inexistente para alvos de seleção balanceadora – com exceção de alguns estudos em genes HLA (e.g. Oosterhout, 2009; Lenz et al., 2016). Entender melhor o impacto da seleção ba- lanceadora na evolução humana é importante não apenas para entender como nos tornamos o que somos, mas também para melhor entender doenças com- plexas que ocorrem com frequência relativamente alta em humanos (Vallender e Johnson, 2008).

44Uma possibilidade, mas não uma certeza. Para os genes HLA sabe-se que muitos sítios são ativamente mantidos polimórficos.

48 Introdução Geral

Questões & Hipóteses

Nesse contexto, as principais questões abordadas nesta tese foram: (1) é pos- sível desenvolver métodos mais poderosos para encontrar regiões do genoma que evoluem sob seleção balanceadora? (2) quais são os alvos de seleção balan- ceadora de longo prazo em humanos? (3) quais são as propriedades biológicas desses alvos: eles são majoritariamente genes (codificadores de proteínas), re- giões regulatórias, ou regiões que afetam a expressão gênica? (4) quais são as ca- tegorias funcionais mais abundantes entre genes-alvo de seleção balanceadora: fora os genes HLA, que estão envolvidos da resposta imune, o que podemos di- zer sobre os alvos em termos de função? (5) o que sabemos sobre a importância biológica de alguns desses genes candidatos, com base em estudos independen- tes? (6) os alvos de seleção balanceadora são partilhados entre populações ou continentes? (7) qual é a prevalência de assinaturas de seleção balanceadora de longo prazo no genoma humano? Podemos quantificar a proporção do genoma humano que foi moldado por mecanismos de manutenção de diversidade? (8) a seleção balanceadora sobre um ou mais sítios interfere na eficácia da seleção purificadora sobre sítios não-neutros adjacentes?

As hipóteses exploradas nesse contexto foram:

• (a) a seleção balanceadora não é muito frequente no genoma, mas pro- vavelmente mais frequente do que o que se estimou até agora, dado que os métodos e/ou os dados utilizados não permitiram obter uma estima- tiva menos conservadora de sua frequência no genoma. A fim de testar essa hipótese, propusemos uma nova estatística (Capítulo 1), com poder aumentado em relação a testes de neutralidade comumente usados e oti- mizada para vasculhar o genoma humano.

49 Introdução Geral

• (b) a seleção balanceadora afeta tanto regiões gênicas quanto regiões regu- latórias/controladoras de expressão.

• (c) a seleção balanceadora afeta majoritariamente genes relacionados com a defesa do organismo (e.g. genes que integram vias do sistema imunoló- gico, proteínas de membrana, de matriz extracelular, etc), ou relacionados a interações com o ambiente extracelular ou com outras células, e com a reprodução45.

• (d) haverá maior compartilhamento de alvos de seleção balanceadora en- tre populações de um mesmo continente do que entre populações de con- tinentes distintos, e o nível geral de compartilhamento entre quaisquer populações será alto; poucos genes/regiões apresentariam sinais opostos de seleção em diferentes populações/continentes.

• (e) existe um excesso de SNPs não-sinônimos (ou de SNPs deletérios) em regiões próximas a alvos de seleção balanceadora. A possível explicação seria que por efeitos de ligação, mutações fracamente deletérias poderiam aumentar de frequência juntamente com os sítios alvos de seleção balan- ceadora. Por outro lado, uma deficiência de mutações deletérias poderia estar relacionada à frequência das variantes (um artefato) ou ao fato de que a seleção balanceadora aumenta o tamanho efetivo local.

As hipóteses (a-d) foram testadas com os alvos obtidos com um método que desenvolvi com colaboradores, que vasculhou o genoma todo em busca de as- sinaturas de seleção balanceadora, para quatro populações de dois continentes.

45Embora haja relativamente poucos casos descritos até o momento e os mecanismos não se- jam totalmente compreendidos, eles parecem promissores. Por exemplo, genes envolvidos em espermatogênese, reconhecimento entre espermatozoide e óvulo, e hormônio folículo estimu- lante (revisado em Vallender e Johnson, 2008).

50 Introdução Geral

Tais hipóteses são exploradas no Capítulo 1. A premissa para a hipótese (e) é que a seleção balanceadora é suficiente- mente mais forte do que a seleção purificadora contra as variantes deletérias, de forma que a segunda não consiga contrabalancear a primeira. Esta última hipótese foi explorada no Capítulo 2.

51 Bibliografia

Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we go from here?” Em: Genome Research 19 (5), pp. 711–722. Allison, A. C. (1954). “Protection Afforded by Sickle-cell Trait Against Subtertian Malarial In- fection.” Em: British Medical Journal 1 (4857), pp. 290–294. Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. Em: eLS, pp. 1–8. Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” Em: Molecular Biology and Evolution 26 (12), pp. 2755–64. Asthana, S., S. Schmidt e S. R. Sunyaev (2005). “A limited role for balancing selection”. Em: Trends in genetics : TIG 21 (1), pp. 30–32. Bamshad, M. e S. P. Wooding (2003). “Signatures of natural selection in the human genome”. Em: Genetics 4 (February), pp. 99–111. Becker, K. G., R. M. Simon, J. E. Bailey-Wilson, B. Freidlin, W. E. Biddison, H. F. McFarland e J. M. Trent (1998). “Clustering of non-major histocompatibility complex susceptibility candidate loci in human autoimmune diseases”. Em: Proceedings of the National Academy of Sciences of the United States of America 95 (17), pp. 9979–9984. Bergland, A. O., E. L. Behrman, K. R. O’Brien, P. S. Schmidt e D. A. Petrov (2014). “Genomic Evi- dence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila”. Em: PLoS Genetics 10 (11), e1004775. Bernardi, G. (2007). “The neoselectionist theory of genome evolution.” Em: Proceedings of the National Academy of Sciences of the United States of America 104 (20), pp. 8385–90. Bersaglieri, T., P. C. Sabeti, N. Patterson, T. Vanderploeg, S. F. Schaffner, J. A. Drake, M. Rhodes, D. E. Reich e J. N. Hirschhorn (2004). “Genetic signatures of strong recent positive selection at the lactase gene.” Em: American Journal of Human Genetics 74 (6), pp. 1111–20.

52 Introdução Geral

Betancourt, A. J. e D. C. Presgraves (2002). “Linkage limits the power of natural selection in Drosophila.” Em: Proceedings of the National Academy of Sciences of the United States of America 99 (21), pp. 13616–20. Birky, W. e J. B. Walsh (1988). “Effects of linkage on rates of molecular evolution.” Em: Procee- dings of the National Academy of Sciences of the United States of America 85, pp. 6414–6418. Bitarello, B. D., R. D. S. Francisco e D. Meyer (2015). “Heterogeneity of dN/dS Ratios at the Classical HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”. Em: Journal of Molecular Evolution 82 (1), pp. 38–50. Borghans, J., J. Beltman e R. Boer (2004). “MHC polymorphism under host-pathogen coevolu- tion”. Em: Immunogenetics 55 (11), pp. 732–739. Bromham, L. e D. Penny (2003). “The modern molecular clock.” Em: Nature reviews. Genetics 4 (3), pp. 216–224. Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing selection.” Em: Genetics 173 (4), pp. 2165–77. Bustamante, C. D. et al. (2005). “Natural selection on protein-coding genes in the human ge- nome”. Em: Nature 437 (7062), pp. 1153–1157. Cai, J. J., J. M. Macpherson, G. Sella e D. A. Petrov (2009). “Pervasive Hitchhiking at Coding and Regulatory Sites in Humans”. Em: PLoS Genetics 5 (1), pp. 1–13. Chakraborty, M. e J. D. Fry (2015). “Evidence that Environmental Heterogeneity Maintains a De- toxifying Enzyme Polymorphism in Drosophila melanogaster”. Em: Current Biology 26 (2), pp. 1–5. Charlesworth, B. (2000). “Fisher, Medawar, Hamilton and the Evolution of Aging”. Em: Genetics 156 (3), pp. 927–931. — (2012). “The effects of deleterious mutations on evolution at linked sites”. Em: Genetics 190 (1), pp. 5–22. Charlesworth, B. e D. Charlesworth (2010). Elements of Evolutionary Genetics. 1ª ed. Roberts e

Company Publishers, p. 768. ISBN: 0981519423. Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome regions.” Em: PLoS Genetics 2 (4), pp. 379–384. Chun, S. e J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the human genome.” Em: PLoS genetics 7 (8), e1002240.

53 Introdução Geral

Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. Em: Taxo- nomy and Geography. Ed. por D. Nichols. Oxford: Systematics Association. Comeron, J. M., a. Williford e R. M. Kliman (2008). “The Hill-Robertson effect: evolutionary consequences of weak selection and linkage in finite populations.” Em: Heredity 100 (1), pp. 19–31. Connallon, T. e A. G. Clark (2013). “Antagonistic versus nonantagonistic models of balancing selection: characterizing the relative timescales and hitchhiking effects of partial selective sweeps.” Em: Evolution; international journal of organic evolution 67 (3), pp. 908–17. Crespi, B. J. (2000). “Short Review The evolution of maladaptation”. Em: Heredity 84 (March 1999), pp. 623–629. Crow, J. F. (1987). “Muller, Dobzhansky, and overdominance”. Em: Journal of the History of Biology 20 (3), pp. 351–380. Cutter, A. D. e B. A. Payseur (2013). “Genomic signatures of selection at linked sites: unifying the disparity among species.” Em: Nature reviews. Genetics 14 (4), pp. 262–74. Darwin, C. (1859). The origin of species: complete and fully illustrated. 1979ª ed. New York: Gra-

mercy Books. ISBN: 9780517123201. — (1876). The effects of cross and self fertilisation in the vegetable kingdom. De Boer, R. J., J. a. M. Borghans, M. van Boven, C. Kesmir e F. J. Weissing (2004). “Heterozygote advantage fails to explain the high degree of polymorphism of the MHC.” Em: Immunoge- netics 55 (11), pp. 725–731. DeGiorgio, M., K. E. Lohmueller e R. Nielsen (2014). “A model-based approach for identifying signatures of ancient balancing selection in genetic data.” Em: PLoS genetics 10 (8), e1004561. Dempster, E. R. (1955). “Maintenance of genetic heterogeneity.” Em: Cold Spring Harbor Symposia on Quantitative Biology. Cold Spring Harbor Laboratory Press, pp. 25–32. Dobzhansky, T. (1937). Genetics and the Origin of Species. 2nd. New York: Columbia University Press. Doherty, P. C. e R. M. Zinkernagel (1975). “Enhanced immunological surveillance in mice hete- rozygous at the H-2 gene complex”. Em: Nature 256 (5512), pp. 50–52. Enard, D., F. Depaulis e H. Roest Crollius (2010). “Human and Non-Human Genomes Share Hotspots of Positive Selection”. Em: PLoS Genetics 6 (2), pp. 1–13.

54 Introdução Geral

Eyre-Walker, A. (2006). “The genomic rate of adaptive evolution.” Em: Trends in ecology & evolu- tion 21 (10), pp. 569–75. Eyre-Walker, A. e P. D. Keightley (1999). “High genomic deleterious mutation rates in homi- nids”. Em: Nature 397 (6717), pp. 344–347. Fay, J. C., G. J. Wyckoff e C.-I. I. Wu (2001). “Positive and negative selection on the human genome.” Em: Genetics 158 (3), pp. 1227–34. Fijarczyk, A. e W. Babik (2015). “Detecting balancing selection in genomes: Limits and pros- pects”. Em: Molecular Ecology, n/a–n/a. Fisher, R. A. (1922). “On the Dominance Ratio.” Em: Proc. R. Soc. 42, pp. 321–341. Fu, W. e J. M. Akey (2013). “Selection and Adaptation in the Human Genome”. Em: Annual Review of Genomics and Human Genetics 14 (1), pp. 467–489. Garrigan, D. e P. W. Hedrick (2003). “Detecting adaptive molecular polymorphism : Lessons from the MHC”. Em: Evolution 57 (8), pp. 1707–1722.

Gillespie, J. H. (1991). The causes of molecular evolution. Oxford: Oxford University Press. ISBN: 0-19-509271-6. Gillespie, J. H. e C. Langley (1974). “A general model to account for enzyme variation in natural populations”. Em: Genetics 76 (4), pp. 837–48. Gloss, A. D. e N. K. Whiteman (2016). “Balancing Selection: Walking a Tightrope”. Em: Current Biology 26 (2), R73–R76. Gravel, S. (2016). “When Is Selection Effective?” Em: Genetics 203 (1), pp. 451–462. Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs e C. D. Bustamante (2011). “Demographic history and rare allele sharing among human populations.” Em: Proceedings of the National Academy of Sciences of the United States of America 108 (29), pp. 11983–8. Haldane, J. (1937). “The Effect of Variation on Fitness”. Em: The American Naturalist 71 (735), pp. 337–349. Harris, E. E. (2010). “Nonadaptive processes in primate and human evolution.” Em: American journal of physical anthropology 143 Suppl, pp. 13–45. Harris, E. E. e D. Meyer (2006). “The Molecular Signature of Selection Underlying Human Adaptations”. Em: Yearbook of Physical Anthropology 130, pp. 89–130.

55 Introdução Geral

Haygood, R., C. C. Babbitt, O. Fedrigo e G. A. Wray (2010). “Contrasts between adaptive coding and noncoding changes during human evolution”. Em: Proceedings of the National Academy of Sciences of the United States of America 107 (17), pp. 7853–7857. Hedrick, P. W. (2006). “Genetic Polymorphism in Heterogeneous Environments: The Age of Genomics”. Em: Annual Review of Ecology, Evolution, and Systematics 37, pp. 67–93. — (2012). “What is the evidence for heterozygote advantage selection?” Em: Trends in Ecology & Evolutiony & evolution 27 (12), pp. 698–704. Hellmann, I., I. Ebersberger, S. E. Ptak, S. Pääbo e M. Przeworski (2003). “A neutral explanation for the correlation of diversity with recombination rates in humans.” Em: American journal of human genetics 72 (6), pp. 1527–35. Hill, W. G. e A. Robertson (1966). “The effect of linkage on limits to artificial selection”. Em: Genetical Research 8 (03), p. 269. Hudson, R. R., M. Kreitman e M. Aguade (1987). “A Test of Neutral Molecular Evolution Based on Nucleotide Data”. Em: Genetics 116 (1), pp. 153–159. Hughes, A. L. e M. Nei (1989). “Nucleotide substitution at major histocompatibility complex class II loci: evidence for overdominant selection”. Em: Proceedings of the National Academy of Sciences of the United States of America 86 (3), pp. 958–962. Hughes, A. L. e M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility class I loci reveals overdominant selection”. Em: Letters to Nature 335 (8), pp. 167–170. Innocenti, P. e E. H. Morrow (2010). “The sexually antagonistic genes of drosophila melanogas- ter”. Em: PLoS Biology 8 (3), e1000335. Jablonski, N. G. e G. Chaplin (2010). “Human skin pigmentation as an adaptation to UV ra- diation”. Em: Proceedings of the National Academy of Sciences 107 (Supplement_2), pp. 8962– 8968. Key, F. M., J. C. Teixeira, C. de Filippo e A. M. Andrés (2014). “Advantageous diversity maintai- ned by balancing selection in humans”. Em: Current Opinion in Genetics & Development 29, pp. 45–51. Kimura, M. (1991). “The neutral theory of molecular evolution: a review of recent evidence”. Em: Japanese Journal of Genetics 66 (4), pp. 367–386. Kimura, M. (1968). “Evolutionary rate at the molecular level”. Em: Nature 217, pp. 624–626.

56 Introdução Geral

— (1983). The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press.

ISBN: 9780511623486. URL: http://ebooks.cambridge.org/ref/id/CBO9780511623486. Kimura, M. e J. F. Crow (1963). “The Measurement of Effective Population Number”. Em: Evo- lution 17 (3), pp. 279–288. — (1964). “The Number of Alleles that Can Be Maintained in a Finite Population”. Em: Genetics 49, pp. 725–738. Klein, J., A. Sato, S. Nagl e C. O’hUigin (1998). “Molecular trans-species polymorphism”. Em: Annual Review of Ecology and Systematics 29, pp. 1–21. Kreitman, M. e A. Di Rienzo (2004). “Balancing claims for balancing selection”. Em: Trends in Genetics 20 (7), pp. 300–304. Lande, R. (1975). “The maintenance of genetic variability by mutation in a polygenic character with linked loci”. Em: Genetical Research 26 (3), pp. 221–35. Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees”. Em: Science 339 (6127), pp. 1578–1582. Lenz, T. L., V. Spirin, D. M. Jordan e S. R. Sunyaev (2016). “Excess of Deleterious Mutations around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. Em: bioRxiv, pp. 1– 30. Levene, H. (1953). “Genetic Equilibrium When More Than One Ecological Niche is Available”. Em: The American Naturalist 87 (836), pp. 331–333. Lewontin, R. C. e J. L. Hubby (1966). “A Molecular Approach to the Study of Genic Heterozy- gosity in Natural Populations. II. Amount of Variation and Degree of Heterozygosity in Natural Populations of Drosophila pseudoobscura”. Em: Genetics 54 (2), pp. 595–609. Lynch, M. (2007). “The evolution of genetic networks by non-adaptive processes.” Em: Nature reviews. Genetics 8 (10), pp. 803–13. Maynard-Smith, J. e J. Haigh (1974). “The hitch-hiking effect of a favorable gene.” Em: Genetical Research (23), pp. 23–35. McDonald, J. H. e M. Kreitman (1991). “Adaptive protein evolution at the Adh locus in Dro- sophila.” en. Em: Nature 351 (6328), pp. 652–4. Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Rel. téc.

Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/41/ 41131/tde-02082013-161104/pt-br.php.

57 Introdução Geral

Meyer, D., R. M. Single, S. J. Mack, H. A. Erlich e G. Thomson (2006). “Signatures of demo- graphic history and natural selection in the human major histocompatibility complex Loci.” Em: Genetics 173 (4), pp. 2121–2142. Mitchell-Olds, T., J. H. Willis e D. B. Goldstein (2007). “Which evolutionary processes influence natural genetic variation for phenotypic traits?” Em: Nature reviews. Genetics 8 (11), pp. 845– 856. Muller, H. J. (1950). “Our Load of Mutations”. Em: The American Journal of Human Genetics 2 (2), pp. 111–176. Nielsen, R. (2005). “Molecular Signatures of Natural Selection”. Em: Annual Review of Genetics 39 (1), pp. 197–218. Nielsen, R. et al. (2005). “A Scan for Positively Selected Genes in the Genomes of Humans and Chimpanzees”. Em: PLoS Biology 3 (6), e170. Ohta, T. (1973). “Slightly Deleterious Mutant Substitutions in Evolution”. Em: Nature 246 (5428), pp. 96–98. — (1995). “Synonymous and nonsynonymous substitutions in mammalian genes and the ne- arly neutral theory”. Em: Journal of Molecular Evolution 40, pp. 56–63. Ohta, T. e J. H. Gillespie (1996). “Development of Neutral and Nearly Neutral Theories”. Em: Theoretical Population Biology 49 (2), pp. 128–142. Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune genes.” Em: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657), pp. 657–65. Orr, H. A. (1998). “The population genetics of adaptation: the distribution of factors fixed during adaptive evolution.” Em: Evolution 52 (4), pp. 935–949. — (2005). “The genetic theory of adaptation: a brief history.” Em: Nature Reviews Genetics 6 (2), pp. 119–27. Prout, T. (1968). “Sufficient Conditions for Multiple Niche Polymorphism”. Em: The American Naturalist 102 (928), pp. 493–496. — (2000). “How well does opposing selection maintain variation?” Em: Evolutionary genetics: from molecules to morphology. Cambridge: Cambridge University Press, pp. 157–181.

58 Introdução Geral

Prugnolle, F., A. Manica, M. Charpentier, J. F. Guégan, V.Guernier e F. Balloux (2005). “Pathogen- driven selection and worldwide HLA class I diversity.” Em: Current Biology 15 (11), pp. 1022– 7. Richman, A. (2000). “Evolution of balanced genetic polymorphism”. Em: Molecular Ecology 9 (12), pp. 1953–1963. Sabeti, P. C., S. F. Schaffner, B. Fry, J. Lohmueller, P. Varilly, O. Shamovsky, A. Palma, T. S. Mik- kelsen, D. Altshuler e E. S. Lander (2006). “Positive natural selection in the human lineage.” Em: Science 312 (5780), pp. 1614–20. Sabeti, P. C. et al. (2007). “Genome-wide detection and characterization of positive selection in human populations.” Em: Nature 449 (7164), pp. 913–8. Ségurel, L., Z. Gao e M. Przeworski (2013). “Ancestry runs deeper than blood: The evolutionary history of ABO points to cryptic variation of functional importance”. Em: BioEssays 35 (10), pp. 862–867. Sella, G., D. A. Petrov, M. Przeworski e P. Andolfatto (2009). “Pervasive Natural Selection in the Drosophila Genome?” Em: PLoS Genetics 5 (6). Sellis, D., B. J. Callahan, D. a. Petrov e P. W. Messer (2011). “Heterozygote advantage as a natural consequence of adaptation in diploids”. Em: Proceedings of the National Academy of Sciences 108 (51), pp. 20666–20671. Slade, R. e H. McCallum (1992). “Overdominant vs. frequency-dependent selection at MHC loci.” Em: Genetics 132, pp. 861–864. Spurgin, L. G. e D. S. Richardson (2010). “How pathogens drive genetic diversity: MHC, me- chanisms and misunderstandings.” Em: Proceedings. Biological sciences / The Royal Society 277 (1684), pp. 979–88. Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly- morphism.” Em: Genetics 123 (3), pp. 585–595. Teixeira, J. C. et al. (2015). “Long-Term Balancing Selection in LAD1 Maintains a Missense Trans- Species Polymorphism in Humans, Chimpanzees, and Bonobos”. Em: Molecular Biology and Evolution 32 (5), pp. 1186–1196. Tishkoff, S. A. e S. M. Williams (2002). “Genetic analysis of African populations: human evolu- tion and complex disease.” Em: Nature Reviews Genetics 3 (8), pp. 611–621.

59 Introdução Geral

Trachtenberg, E. et al. (2003). “Advantage of rare HLA supertype in HIV disease progression”. Em: Nature Medicine 9, pp. 928–935. Vallender, E. J. e W. E. Johnson (2008). “Balancing Selection in Human Evolution”. Em: eLS. Watterson, G. A. (1978). “The homozygosity test of neutrality.” Em: Genetics 88 (2), pp. 405–17. Williams, G. C. (1957). “Pleiotropy, Natural Selection, and the Evolution of Senescence”. Em: Evolution 11 (4), p. 398. Wright, S. (1937). “The Distribution of Gene Frequencies in Populations.” Em: Proceedings of the National Academy of Sciences 23 (6), pp. 307–320. Yang, Z. e W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution that Account for Heterogeneous Selective Pressures Among Site Classes”. Em: Molecular Biology and Evolution 19 (1), pp. 49–57. Zhang, Z. e J. Parsch (2005). “Positive correlation between evolutionary rate and recombination rate in Drosophila genes with male-biased expression.” Em: Molecular Biology and Evolution 22 (10), pp. 1945–7.

60 Capítulo 1

Buscando alvos de seleção balancea- dora no genoma humano

Considerações Iniciais

Neste capítulo apresento um manuscrito – atualmente em revisão final pelos co-autores – em que desenvolvemos uma nova estatística para detecção de ins- tâncias de seleção balanceadora no genoma humano. Ela quantifica diretamente as duas principais assinaturas de regimes de seleção balanceadora atuantes por longas escalas de tempo: um excesso de alelos segregando em frequências in- termediárias e um excesso de sítios polimórficos em relação às expectativas sob um modelo nulo.

Cerca de um terço dos genes que detectamos com essa nova estatística tem evidência prévia de seleção balanceadora – de acordo com métodos e dados bastante diferentes dos nossos. Contudo, descrevemos também mais de 150 novos genes candidatos, bem como regiões não-codificadoras candidatas e as propriedades dessas regiões.

61 Capítulo 1

Nosso método tem maior poder que outros descritos na literatura, e é ex- tremamente simples de ser implementado e interpretado, além de rodar rapi- damente. Combinado a um dedicado controle de qualidade dos dados utiliza- dos, e verificação das regiões candidatas obtidas, acreditamos ter fornecido um mapa extremamente confiável da extensão das assinaturas de seleção balance- adora no genoma humano. Com este trabalho, contribuímos para a literatura (não muito extensa) de seleção balanceadora em humanos, além de propormos um método com alto poder estatístico que, em princípio, pode ser utilizado em abordagens semelhantes para outras espécies. Este trabalho foi feito em colaboração com a pesquisadora Aida M. Andrés, do Max Planck Institute for Evolutionary Anthropology (MPI-EVA, Leipzig), que concebeu a ideia do novo método. O trabalho começou em 2013, durante meu doutorado sanduíche, e contou com a co-supervisão de Diogo Meyer e A.M.A. Contei ainda com a colaboração dos alunos Cesare de Filippo (pós- doutorando, MPI-EVA) e João C. Teixeira (doutorando, MPI-EVA). J.C.T. rea- lizou parte das análises de enriquecimento para as regiões candidatas, e C.F. colaborou nas etapas de simulações para avaliação da estatística e na imple- mentação do scan em si. O manuscrito foi redigido por mim, juntamente A.M.A. e D.M, e todos os autores contribuíram com comentários sobre a redação do mesmo. Ele será sub- metido para o períódico Plos Genetics. Todo o material suplementar citado no texto foi disponibilizado no fim do capítulo.

62 Capítulo 1

Uncovering targets of balancing selection in the human genome

Bárbara Domingues Bitarello1, Cesare de Fillipo2, João Teixeira2, Diogo Meyer1*, and Aida M. Andrés2*

*, co-supervised the study

1, Universidade de São Paulo, São Paulo, Brazil

2,Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany

Introduction

ALANCINGSELECTION refers to a class of selective mechanisms that main-

tain advantageous genetic diversity in populations. Although perhaps not a B pervasive form of natural selection, balancing selection maintains genetic di- versity with phenotypic relevance. For example, decades of research have established

HLA genes as a prime example of balancing selection (Meyer and Thomson, 2001; Spur- gin and Richardson, 2010) with thousands of alleles segregating in humans, extensive support for the functional relevance of polymorphism (e.g., Hedrick et al., 1991; Prug- nolle et al., 2005) and various well-documented cases of association between selected alleles and disease resistance and susceptibility (e.g. Raychaudhuri et al., 2012; Howell,

2014).

The catalog of well-understood non-HLA targets of balancing selection remains small, but genes identified are associated to phenotypes such as auto-immune dis- eases (Raychaudhuri et al., 2012), malaria resistance (Malaria Genomic Epidemiology

Network, 2015), resistance to HIV infection (Biasin et al., 2007) and polycystic ovary syndrome (Day et al., 2015). Thus, the relevance of balanced polymorphisms is not re-

63 Capítulo 1 stricted to their historical influence on individual fitness: they also shape, today, human phenotypic diversity and susceptibility to disease.

Balancing selection encompasses several mechanisms (Andrés, 2011; Charlesworth and Charlesworth, 2010; Clarke, 1962; Fijarczyk and Babik, 2015; reviewed in Andrés,

2011; Key et al., 2014b). These include heterozygote advantage (or overdominance)

(Andrés, 2011; Key et al., 2014b; Fijarczyk and Babik, 2015), frequency-dependent se- lection, including rare allele advantage (Clarke, 1962; Charlesworth and Charlesworth,

2010), selective pressures that fluctuate in time (Andrés, 2011; Bergland et al., 2014; Fi- jarczyk and Babik, 2015) or in space in panmitic populations (Andrés, 2011; Charlesworth et al., 1997; Charlesworth, 2006; Fijarczyk and Babik, 2015; Key et al., 2014b) and cases of pleiotropy (Johnston et al., 2013). For some mechanisms, including overdominance, pleiotropy, and some instances of selection that varies in space, a stable equilibrium can be reached (Charlesworth and Charlesworth, 2010). For other mechanisms the frequency of the selected allele can change in time with no theoretical equilibrium fre- quency, although the frequency of the balanced polymorphism will be strongly affected by the selective process.

Regardless of the mechanism, balancing selection can increase genetic variation with respect to neutral expectations and has the potential to leave identifiable signa- tures in genomic data. These include local site-frequency spectra with an excess of alleles close to the frequency of the balanced allele and, when selection is old enough, an excess of polymorphisms relative to divergence (reviewed in Key et al., 2014b). In some cases, very ancient balancing selection can maintain trans-species polymorphisms in sister species (Leffler et al., 2013; Teixeira et al., 2015), while recent balancing selection or selection that is transient (e.g., that predicted in the model of Sellis et al., 2011) will result in signatures that are probably difficult to distinguish from incomplete, recent positive selection sweeps (Key et al., 2014b).

While balancing selection has been extensively explored from a theoretical perspec-

64 Capítulo 1 tive, an empirical understanding of its prevalence in the human genome lags behind our knowledge of positive selection. This stems from technical difficulties in detect- ing balancing selection, as well as the perception that balancing selection may be rare

(Hedrick, 2012). In fact, few methods have been developed to identify its targets, and only a handful of studies have sought to uncover targets of balancing selection genome- wide (Andrés et al., 2009; Alonso et al., 2008; Asthana et al., 2005; Bubb et al., 2006;

Leffler et al., 2013; DeGiorgio et al., 2014; Rasmussen et al., 2014; Teixeira et al., 2015), with different methods and datasets. Andrés et al. (2009) and DeGiorgio et al. (2014) identified, with different approaches, genes (Andrés et al., 2009) or genomic regions

(DeGiorgio et al., 2014) with an excess of polymorphism and with site-frequency spec- tra showing an excess of intermediate frequency alleles.

Leffler et al. (2013) and Teixeira et al. (2015) identified trans-species polymorphisms between humans and other . Overall, these studies suggested that balancing selection may act on a relatively small portion of the genome, although the limited ex- tent of the data available (e.g. exome data in Andrés et al., 2009 and small sample size in DeGiorgio et al., 2014), and the stringency of the criteria - e.g., balanced polymor- phisms that pre-date human-chimpanzee divergence in Leffler et al. (2013) and Teixeira et al. (2015) - may underlie the paucity of targets detected.

Here, we developed two new test statistics that summarize, directly and in a sim- ple way, the degree to which allele frequencies in a genomic region deviate from the frequencies expected under balancing selection. Through extensive simulations, we showed that one of our methods outperforms existing methods for realistic demo- graphic scenarios for human populations. We applied our statistic to the genome-wide

1000 Genomes Project (Abecasis et al., 2012) data in four human populations and used both outlier and simulation-based cut-offs to identify both known and new genomic regions that have evolved under long-term balancing selection.

65 Capítulo 1

Results

NCD Method

Background Owing to linkage, the signature of long term balancing selection (LTBS) on a site extends to the genetic neighborhood of the selected variant(s), so the patterns of polymorphism and divergence in a genomic region can be used to infer whether it evolved under balancing selection (Charlesworth, 2006; Andrés, 2011). LTBS leaves two distinctive signatures in linked variation, when compared with neutral expecta- tions. The first is an increase in the ratio of polymorphic to divergent sites. This occurs because, by reducing the probability of fixation, balancing selection increases the local

TMRCA (Hudson and Kaplan, 1988). One commonly used test to detect this signature is the HKA test (Hudson et al., 1987).

The second signature is an excess of alleles segregating at intermediate frequencies.

In humans, the folded site frequency spectrum (SFS), which is the distribution of the fre- quency of the minor alleles (MAF) regardless of whether they are ancestral or derived, is typically L-shaped, showing an excess of low-frequency alleles when compared to expectations under neutrality and demographic equilibrium. This is a consequence of recent population expansions (e.g. Coventry et al., 2010), with the abundance of rare al- leles further increased by purifying selection and recent selective sweeps. On the other hand, regions under LTBS are expected to show a markedly different SFS, with propor- tionally more alleles at intermediate frequency (Fig 1A-B). Such a deviation in the SFS is the signature identified by classical neutrality tests, such as Tajima’s D (Tajima, 1989) and newer statistics such as MWU-high (Nielsen et al., 2009).

The signatures of LTBS on the SFS will depend on the selective regime and the intensity of selection on each genotype. For example, under overdominance the fre- quency equilibrium depends on the relative fitness of each genotype (Charlesworth,

66 Capítulo 1

Figure 1. Schema for NCD statistics definition (A) Schematic representation of distributions of derived allele frequencies (un- folded SFS) expected for loci under neutrality (grey), containing one site under balancing selection with frequency equilibrium of 0.5 (blue), 0.4 (orange) and 0.3 (pink). DAF is the derived allele frequency, ranging from 0 to 1. (B) Schematic representations of distributions of minor allele frequencies (folded SFS), rang- ing from 0 to 0.5. Colors as in A. (C) Schematic representation of density plots of the distribution of NCD expected under neutrality (grey) and under selection (following the feq values given in (A).

67 Capítulo 1

2006; Charlesworth and Charlesworth, 2010; Fijarczyk and Babik, 2015). Given selec- tion coefficients s and t against the AA and BB homozygotes, respectively, the deter- ministic frequency equilibrium ( feq) is given by:

s f = (1) eqA s + t

With symmetric overdominance (s = t), feq = 0.5. With asymmetric overdomi- nance (t 6= s), which might be more prevalent in natural systems (Hedrick, 2012), it follows that feq 6= 0.5. A classic example of asymmetric overdominance is the case of β-globin and sickle cell anemia, where in regions of endemic malaria the fitness of the

HbA homozygote for the β-globin locus is approximately 9 times higher than that of the HbS homozygote, with the resulting equilibrium frequency of the HbS allele being

0.13 (Allison and Clyde, 1961). Under frequency-dependent selection, feq will depend on the frequency of the favored allele. Under fluctuating selection the frequency of the selected allele will depend on the temporal and spatial scales of selection (Andrés, 2011;

Clarke, 1964; Pasvol et al., 1978) and although no stable, long-term frequency equilib- rium may be reached, the balanced polymorphism may be actively maintained (as long as the heterozygote fitness exceeds that of homozygotes in their harmonic and geomet- ric means, for spatial and temporal models, respectively) (reviewed in Hedrick, 2006).

In these cases, feq can be thought of as the frequency, at the time of sampling, of the balanced polymorphism.

Non-Central Deviation (NCD) In the tradition of neutrality tests that analyze di- rectly the SFS (e.g. Nielsen et al., 2005; Nielsen et al., 2009; Williamson et al., 2007), we propose two related test statistics that explore the abundance and frequency of poly- morphisms in a given locus. Both tests measure a “Non-Central Deviation” (NCD), which we define as the degree to which the local SFS deviates from a pre-specified al- lele frequency (the target frequency, t f ). Under a model of balancing selection, t f can

68 Capítulo 1 be thought of as the deterministic frequency that would be attained given the selection parameters, with the NCD statistic querying how far SNP frequencies are from it. We propose two implementations for this statistic: NCD1 and NCD2. The NCD1 statistic is based solely on the SFS, using information on allelic frequency, pi , of each site in a locus:

v u n u 2 u ∑ (pi − t f ) = NCD1 = t i 1 (2) t f n

where i = 1,2,3,...,n is the i-th polymorphism, pi is the MAF for the i-th polymor- phism, and t f is is the target frequency with respect to which the deviations of the observed alleles frequencies are computed. Thus, NCD1 is a non-central standard de- viation that quantifies the dispersion of allelic frequencies from t f , rather than from the mean of the distribution. Because the frequencies of alleles at bi-allelic loci are comple- mentary, and under balancing selection there is no prior expectation on the ancestral or the derived allele being maintained at higher frequency, we use the folded SFS (Fig

1). The minimum amount of data required for calculating NCD1 is one polymorphism, and for simplicity we consider only bi-allelic SNPs.

The NCD2 statistic is an extension of NCD1 that includes information not only on the frequency of polymorphisms, but also on the number of fixed differences (FDs):

v u n u · ( − )2 + ( − )2 u n 0 t f ∑ pi t f t i=1 NCD2t f = (3) n f d + n

, where n f d is the number of FDs in the locus. In NCD2, all informative sites (IS = SNPs + FDs) are taken into account. FDs can be considered informative sites with a minor allele frequency (MAF) of 0, and as such they contribute to deviation from t f : the greater the number of fixed differences, the larger the NCD2 value and hence the weaker the support for LTBS. The minimal data required for calculating NCD2 is one

69 Capítulo 1 informative site, and for simplicity only bi-allelic allelic SNPs and single nucleotide FDs are considered.

From equations 2 and 3 it follows that the maximum value for NCD2t f for a given t f is the target frequency itself (i.e, no SNPs and one or more FDs in the locus, as in S1

Fig) and for NCD1t f the maximum value approaches - but never reaches - t f when all

SNPs are singletons. The minimum value for both NCD1t f and NCD2t f is 0, when all

SNPs segregate at t f and, in the case of NCD2t f , there are no FDs (S2 Fig). Thus, low NCD1 and NCD2 values reflect a low deviation of the SFS from the pre-defined target frequency, which is expected in windows containing sites under LTBS (Fig 1C).

Power of the NCD statistics to detect LTBS

We evaluated the specificity and sensitivity of NCD1 and NCD2 by benchmarking their performance using simulations. Specifically, we considered demographic scenarios in- ferred for African, European, and Asian human populations (Fig 2), and simulated both neutrality and balancing selection using a model of heterozygote advantage (see Meth- ods). We then explored the influence of the parameters that may affect the power of the NCD statistics: time since the onset of balancing selection (Tbs), the deterministic frequency equilibrium defined by the selection coefficients ( feq), the demographic his- tory of the sampled population, the chosen target frequency in NCD calculation (both for cases in which feq does and does not match t f ), the length of the genomic region analyzed (L), and the implementation of NCD (NCD1 or NCD2). Box 1 summarizes the nomenclature used throughout the text.

70 Capítulo 1

BOX 1. List of Abbreviations

LTBS, long-term balancing selection.

MAF, minor allele frequency.

SFS, site-frequency spectrum.

FD, fixed difference (between two species).

IS, informative sites (number of polymorphic sites in the ingroup species plus

the number of fixed differences between ingroup and outgroup species).

feq, deterministic equilibrium frequency achieved by site(s) under balancing selection as defined by the selection coefficients.

t f , target frequency, i.e, the frequency used in the NCD statistics as the value

to which queried allele frequencies are compared to.

NCD statistics, non-central deviation statistics, with two implementations.

NCD1, NCD statistic that measures the average distance between poly-

morphic allele frequencies and a pre-determined frequency (t f ). NCD1t f is NCD1 for that given t f .

NCD2, NCD statistic that measures the average distance between allelic

frequencies and a pre-determined frequency (t f ) considering both polymor-

phisms and fixed differences with an outgroup. NCD2t f is NCD2 for that given t f .

NCDt f refers to the average of NCD1t f and NCD2t f .

For simplicity we present power values (always at false positive rate of 5%) aver- aged across NCD implementations (NCDt f being the average of NCD1t f and NCD2t f ), demographic models and sequence lengths. These averages are helpful in that they reflect the general changes in power when changing individual parameters. Never- theless, because they often include conditions for which power is low, the averages underestimate the power that the test can reach under a given parameter. The full ma-

71 Capítulo 1

Figure 2. Human demographic model and parameters used in simulations For all simulations, the human demographic model is the one inferred in Gravel et al. (2011) , including migration rates, population split times, and effective population sizes (Ne). Divergence with chimpanzees was added to this model. g, generations; T time in generations since different events: the split between human and chimpanzee lineages (Tdiv), the population growth in African (Tga f ), the out-of-Africa migration (Tooa), and the European-Asian split (Teuas); N refers to Ne of different populations: the ancestral population (Nanc), the chimpanzee population (Nc), the ancestral human population (Nh), the African population after Tga f population growth (Na f r), the Eurasian ancestral population (Nooa), the European population (Neur) and the Asian population; r are the growth rates of Asian and European populations. Tbs is the time (in millions of years) since onset of balancing selection, and feq the frequency equilibrium of the balanced polymorphism.

72 Capítulo 1 trix of power results for each condition is presented in S1 Table, and some key points are discussed below.

Time since the onset of balancing selection and sequence length The signa- tures of LTBS are expected to be stronger the longer the time since the onset of balancing selection, because there will have been more time for linked mutations to accumulate and reach intermediate frequencies. We simulated sequences with a balanced polymor- phism with Tbs of 1, 3, and 5 million years (myr) (Fig 2). For simplicity, in this section we consider only cases where t f = feq although this condition is relaxed in later sec- tions.

For both NCD10.5 and NCD20.5 ( feq = 0.5), power to detect balancing selection with Tbs = 1 myr is low across all scenarios and for all t f (always lower than 0.43, S1 Table).

Nevertheless, power to identify older balanced polymorphisms is high, for all t f , for both 3 myr (e.g. average NCD0.5 is 0.70) and 5 myr (average NCD0.5 0.77) (S3-S8 Figs, S1 Table). We thus focus exclusively on long-term balancing selection: 3 and 5 myr.

Tbs also affects the length of the region bearing the signature of balancing selection, as in the absence of epistasis the long-term effects of recombination result in narrower signatures with older Tbs (Leffler et al., 2013; Teixeira et al., 2015). This is indeed the case for all t f (S3-S8 Figs, S1 Table). For example, NCD0.5 at Tbs = 5 myr resulted in average power of 0.78, 0.76, and 0.67 for 3, 6, and 12 Kb, respectively (S3-S8 Figs, S1

Table), and a similar pattern emerges for NCD0.4 and NCD0.3. For NCD1 the power increment for shorter regions was less pronounced than for NCD2 (S1 Table), perhaps due to the lower number of informative sites. Again, a similar picture emerges for

NCD0.4 – with 21% reduction in power for 12 Kb compared to 3 Kb – and NCD0.3 – with 25% reduction in power for 12 Kb compared to 3 Kb (S1 Table; S3-S8 Figs).

In summary, the power of the NCD statistics grows with the age of the balanced polymorphism and the narrowness of the analyzed window. These analyses suggest that the NCD statistics are well powered to detect balancing selection that started at

73 Capítulo 1 least 3 myr ago in windows of 3 Kb centered on the selected site (S1 Table) and we therefore do not include 1 myr results in the remainder of the discussion.

Demography Power is similar for samples simulated under the African (average

NCD0.5 of 0.86) and European (average NCD0.5 of 0.87) demographic scenarios for both

NCD10.5 and NCD20.5 and drastically lower for a population under the demographic model for Asians (average NCD0.5 of 0.48; S3-S8 Figs, and S1 Table). Similar trends are observed for NCD0.4 (75% average reduction in Asia when compared with Africa) and NCD0.3 (92% reduction). One explanation for the lower power under the Asian demographic model is the stronger effect of random genetic drift in this population due to its lower Ne (Gutenkunst et al., 2009; Gravel et al., 2011), which affects both the SFS of neutral loci (putatively increasing the proportion of alleles at intermediate frequency) and those under balancing selection (reducing the efficacy of selection and putatively increasing the dispersion from the balanced frequency equilibrium). We thus focused our subsequent analyses on the African and European populations, for which power was high and comparable (thus allowing fair comparisons between these geographic regions).

Simulated and target frequencies So far we discussed only cases where t f = feq, which is expected to favor the performance of the method. In this case the NCD statis- tics have high power: on average 0.86, 0.79, and 0.70 for feq = 0.5, 0.4, and 0.3, respec- tively (S1 Table). Selection with feq=0.2 resulted in low power across all parameters and t f values (S3-S8 Figs), so we do not further explore this target frequency.

In practice, though, one does not have a priori knowledge about the equilibrium frequency of balanced polymorphisms. We thus explored the power of NCD when the simulated equilibrium and the target frequencies differ. The power to detect LTBS is very high for NCD20.5 and NCD10.5, even when selection is simulated with other feq values (average NCD0.5 of 0.79, Table 1, S3-S8 Figs, and S1 Table) and similarly high for

74 Capítulo 1

Table 1. Power for simulations under the African and European demographic models Tbs, time since onset of balancing selection (in millions of years); feq, frequency equilibrium in the simulations. Power values are for a false positive rate of 0.05, for simulations of the African and European demographic scenarios, L = 3 Kb.

Africa Europe NCD2 NCD1 NCD2 NCD1 t f t f t f t f Tbs feq 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3 5 0.5 0.96 0.94 0.84 0.93 0.91 0.39 0.97 0.95 0.83 0.92 0.85 0.20 5 0.4 0.94 0.94 0.89 0.89 0.89 0.67 0.95 0.94 0.91 0.85 0.82 0.59 5 0.3 0.90 0.91 0.93 0.72 0.80 0.84 0.84 0.85 0.89 0.47 0.57 0.74 3 0.5 0.91 0.88 0.68 0.86 0.80 0.24 0.93 0.89 0.68 0.81 0.69 0.14 3 0.4 0.88 0.86 0.76 0.78 0.78 0.56 0.89 0.87 0.79 0.74 0.71 0.46 3 0.3 0.75 0.77 0.81 0.56 0.64 0.71 0.73 0.76 0.79 0.39 0.48 0.63

NCD0.4 (average 0.78) and NCD0.3 (average 0.70) (S1 Table).

Conversely, power to detect LTBS with feq = 0.4 is similar with NCD0.5 or NCD0.4

(Table 1 and S1 Table), but for feq = 0.3 power is 10% is higher for NCD0.3 than for

NCD0.5 (Table 1 and S1 Table). Therefore, NCD statistics can be well powered both when the frequency of the balanced polymorphism is the same as the target frequency, and when it is not (as expected given correlations among these statistics; S9 Fig). Nev- ertheless, the closest t f is to feq, the highest the power to identify targets of LTBS (Table 1). Thus, information is gained by calculating NCD with different target frequencies.

NCD implementations The power for NCD2 is greater than for NCD1, for all t f : feq = 0.5 (average power of 0.94 for NCD20.5 and 0.88 for NCD10.5), feq = 0.4 (0.93 for

NCD20.4 and 0.80 for NCD10.4) and for feq = 0.3 (0.85 for NCD20.3 and 0.73 for NCD10.3 (Table 1, Fig 3, S1 Table). The gain in power that occurs when using information on FDs was also explored by jointly considering NCD1 with HKA (see below).

NCD statistics compared to existing methods We compared the power of NCD20.5 to other statistics commonly used to detect balancing selection. We focused on Tajima’s

D (TajD) and HKA (Hudson et al., 1987; Tajima, 1989), a pair of composite likelihood-

75 Capítulo 1 oe odtc TSfrsmltosweeteblne oyopimwsmdldt civ rqec equilib- frequency achieve to modeled was polymorphism balanced the where ( simulations rium for LTBS detect between to comparison Power for curves ROC 3. Figure ,wihwr vlae ae n10SP n1 bsmltdwnosfollowing windows simulated Kb S7). 15 (Fig in for demography SNPs European frequency 100 for Target on observed based are Methods). evaluated results (see were which publication 2014 ), original al., the et (DeGiorgio T2 and T1 f eq of ) A) 0.3, B) .,and 0.4, C) ..Potdvle r o fia demography, African for are values Plotted 0.5. NCD 2 0.5 n te tests other and NCD and 1 NCD ace h simulated the matches 2 Tbs = myr, 5 L = b xetfor except kb, 3 f eq Similar .

76 Capítulo 1 based measures recently developed by DeGiorgio et al. (2014) termed T1 and T2 (T1 only looks at the SFS, T2 includes information on FDs), and NCD1 and NCD2. We additionally explored the power of a composite statistic, where the p-value was jointly computed as a function of NCD1 and HKA statistics (NCD1+HKA), with the goal of quantifying the contribution of FDs to NCD power (see Methods). For simplicity we considered Tbs = 5 my and 3 Kb for all comparisons. The only exceptions are T1 and

T2, for which a larger window size (100 informative sites) was used, following DeGior- gio et al. (2014), to compare the methods using their optimal window size.

When feq = 0.5, NCD20.5 has the highest power: 0.96 (0.94 for T2, 0.93 for TajD, 0.91 for NCD10.5+HKA, 0.78 for HKA, and 0.5 for T1; Fig 3). The gain in power provided by

NCD20.5 is much higher when feq departs from 0.5, where NCD2 clearly outperforms all other tests if t f = feq (Fig 3). For feq = 0.4, the power of NCD20.4 is 0.94 (0.9 for

TajD, T2, and NCD10.5+HKA; 0.76 for HKA, and 0.58 for T1; Fig 3 and Table 1) and for feq = 0.3 NCD20.3 power is 0.91 (0.89 for T2, 0.85 for NCD10.5+HKA, 0.75 for TajD, 0.73 for HKA, 0.59 for T1; Fig 3). These patterns are consistent in both African and

European simulations (Fig 3, Table 1, S10 Fig). Thus, NCD2 has greater or comparable power to detect LTBS than TajD, HKA, T1 and T2, and a combined test of NCD1+HKA for African and European scenarios (Fig 3, Table 1, S7 Fig). Notably, as the simulated frequency equilibrium moves away from 0.5, its advantage over TajD increases (Fig 3).

Recommendations based on power analyses. Overall, NCD1 and NCD2 per- form very well in regions of 3 Kb (Table 1, Fig 3). In fact, NCD2 outperforms all other methods tested (Table1 , Fig 3) and it reaches very high power when t f = feq (higher than 0.9 for 5 myr alleles and than 0.79 for 3 myr alleles). While the feq of a puta- tively balanced allele is unknown, the simplicity of the statistic makes it trivial to run it for several t f values. Importantly, power was very similar under the African and

European models (Table 1, Fig 3, S10 Fig). Because NCD2 outperforms NCD1 we rec- ommend using of NCD2 in humans, although NCD1 is a good choice when outgroup

77 Capítulo 1 data is lacking.

Identifying signatures of LTBS in the human genome

We aimed to identify regions of the genome under LTBS. Based on the power analyses, we used NCD20.5, NCD20.4 and NCD20.3, which are well powered to detect LTBS and do not provide fully overlapping sets of candidate windows. We calculated these statis- tics for 3kb windows (1.5kb step size) and tested for significance using two complemen- tary approaches: one based on neutral expectations, and one based on the empirical data. We analyzed genome-wide data from two of African (YRI: Yoruba in Ibadan,

Nigeria; LWK: Luhya in Webuye, Kenya) and two European populations (GBR: British from England and Scotland; TSI:Toscani in Italy) (Abecasis et al., 2012). We filtered for mappability, segmental duplications, and orthology with the outgroup genome (chim- panzee, see Methods and S13 Fig).

In addition, because windows with a low number of IS have high NCD2 variance due to noisy SFS (S18 Fig), a pattern also observed in neutral simulations (S11 Fig), we excluded windows with less than 19 and 15 IS in African and European populations, re- spectively. This filter removed only 4% of the windows while keeping a set of windows for which NCD2 values remain quite stable regardless of the number of IS (S11-S18

Figs). After all filters, the genomic coordinates defining the windows were identical in all populations, allowing comparisons among them. We analyzed 1,631,372 windows throughout the genome (Table 2, S13 Fig). These windows overlapped 18,308 protein- coding genes (95% of all human autosomal genes). For each window we calculated a p-value that reflects the quantile of its NCD2 value, when compared with the NCD2 distribution of 10,000 neutral simulations under the inferred demographic history of each population, and conditioned on the same number of IS (to account for the higher variance in sets of windows witlow number of IS, Methods).

Over all populations, between 4,826 and 5,910 (0.30-0.36%) of the genomic win-

78 Capítulo 1

dows have a lower NCD20.5 value than any of the 10,000 neutral simulations (p-value <

0.0001, Table 2). The proportions were very similar for NCD20.4 and NCD20.3: between 0.34-0.39% and 0.33-0.38%, respectively (Table 2). We refer to these simulation-based sets, whose patterns we cannot explain under neutrality, as the significant windows.

Due to our criterion for defining significance, all significant windows had an identi- cal p-value (p < 0.0001). To quantify the degree of departure from neutral expectations,

NCD2 was compared to the mean of NCD2 values for the 10,000 simulations with the same number of IS. We defined, for each genomic window, Ztf (Equation 4) as the num- ber of standard deviations that its NCD value for each window lies from the neutral expectation, conditioned on the number of informative sites of that window. To iden- tify the most extreme signatures of LTBS, we selected the windows with the 0.05% most extreme Ztf values for each population and t f value (resulting in 816 outlier windows), which we refer to as the outlier windows (Table 2). The empirical outlier windows, which represent a smaller and more conservative set of genes, are almost entirely a sub- set of the significant windows (Methods). Below, we discuss properties of the union of all significant (or outlier) windows (Table 2) taken over all of the target(s) frequency(s) under which they reached significance (“U” set, Table 2).

Reliability of significant and outlier windows

The significant windows are extremely rich both in polymorphic sites (Fig 4) and num- ber of intermediate-frequency alleles (Fig 5), with the shape of the SFS depending on the t f at which they reach significance. These patterns are not unexpected, since they were used to identify these windows. Nevertheless, they show that neither SNP den- sity nor the SFS dominate the selection process, as significant windows are unusual in both aspects. Also, the striking differences with respect to the background empirical distribution, combined with the fact that no neutral simulation had lower NCD value than any significant window, discard relaxation of selective constraint as a plausible

79 Capítulo 1 explanation (Andrés et al., 2009).

Figure 4. Polymorphism to divergence A) LWK population. B) GBR population. P/(FD+1) measures the proportion of polymorphisms with respect to all informative sites. Background (grey) are all non-significant windows. Significant windows are the union of significant windows for all t f values.

To avoid technical artifacts among significant windows we carefully considered mapping errors due to genomic duplicates (e.g. we removed positions with poor map- pability, and those that fall within tandem repeats and segmental duplications; S13 Fig and Methods). Also, we found that the significant windows have extremely similar coverage to the rest of the genome (S14 Fig), showing that they are not enriched in unannotated, polymorphic duplications.

We also examined whether evidence of selection could be driven by two biological mechanisms other than balancing selection: introgression and gene conversion. The outlier windows are significantly depleted of SNPs annotated as introgressed from Ne- anderthals (S17 Fig, S1 Text), and significant windows do not show a different propor- tion of introgressed SNPs from controls, showing that introgression is not a confound- ing mechanism leading to significant or outlier regions (S7 Fig, S1 text). Finally, the genes overlapped by significant windows are not predicted to be particularly affected

80 Capítulo 1 by non-homologous gene conversion with neighboring paralogs, with the exception of olfactory receptors (S16 and S19 Figs, S1 Text). Thus, the significant and outlier win- dows represent a catalog of strong candidate targets of balancing selection in human populations that are not likely to be driven by introgression or gene conversion (S16,

S17, S19 Figs, S1 Text).

Non-random distribution across the genome

Significant and outlier windows were not randomly distributed across the genome.

Chromosome 6 is the most enriched for signatures of LTBS, contributing 11.2% of sig- nificant windows genome-wide (24.5% of outlier windows) while having only 6.4% of analyzed windows (S12 Fig). This is due to the presence of the MHC region, rich in genes with well-supported evidence for balancing selection. In fact, several HLA genes known to be targets of LTBS appear among our outlier windows, i.e, the strongest can- didates. For the outlier windows, 10 HLA genes are found in all four populations, most of which have prior evidence for balancing selection (Table 3): HLA-B,HLA-C,

HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DQB2, HLA-DRB1, HLA-DRB5,

HLA-G (DeGiorgio et al., 2014; Liu et al., 2006; Meyer et al., 2006; Sanchez-Mazas, 2007;

Solberg et al., 2008; Tan et al., 2005).

The biological pathways influenced by LTBS

Although the union of significant windows considering all t f values span on average only 0.51% of the genome (Table 2), 37.8% of those windows overlap protein-coding genes. To gain insight on the biological pathways influenced by balancing selection, we focused on protein-coding genes that contain at least one significant or outlier window

(“U” set, Table 2), and investigated the functional categories they belong to.

We found enrichment for 30 GO (Gene Ontology) categories for the significant genes (S2 Table), 22 of which are shared by at least two populations. Three significant

81 Capítulo 1 categories are driven by olfactory receptor genes (OR), which we could not rule out as artifacts (S1 Text), although they do not appear in the more conservative outlier set of genes (S3 Table). Among the remaining categories, at least half of them are directly related to immune response (e.g. “type I interferon signaling pathway”, “MHC class

I protein complex”, “positive regulation of T cell mediated cytotoxicity”) and 11 are involved in antigen presentation by MHC molecules (e.g. “MHC class I protein com- plex”, “MHC class II protein complex”, “peptide antigen binding”, among others). For the outlier genes, 27 enriched categories were found, at least 18 of which are immune- related, and 10 of which are directly related to antigen presentation by MHC molecules

(S3 Table).

When classical HLA genes are removed from the sets, no categories remain enriched for the outlier genes (S3 Table; but note that this resulted in a small set of 162-192 genes per population, with lower power to detect GO category enrichment), as in DeGiorgio et al. (2014). For the larger set of significant genes, the immune related category “pep- tide antigen binding” remains significant in LWK , driven by TAP1, TAP2, and HLA-G, all previously reported candidate targets of balancing selection (Cagliani et al., 2011;

Tan et al., 2005). These results show the strong influence of the classical HLA genes to signatures of LTBS. However, “extracellular region” and “keratin filament” are en- riched in the set of significant genes, in several populations, even after the removal of HLA genes, in agreement with previous findings pointing that balancing selection targets genes related to extracellular and cell-surface proteins (Key et al., 2014b).

Nevertheless, for the significant genes only about half of the immune-related en- riched categories are directly linked to peptide presentation by MHC molecules. Other categories (e.g. “type I interferon signaling pathway”, “cytokine mediated pathway”,

“T cell co-stimulation”, “immune response”), even if they cease to be significantly en- riched after the removal of HLA genes (S2 Table), are not strictly composed of HLA genes.

82 Capítulo 1

In order to gain more insight on the importance of non-HLA immune related genes to the outlier set of genes, we verified that the GO categories of 62 outlier genes shared by at least two populations (Table 3) are immune-related, although only 10 HLA genes compose that set (S8 Table). This shows that not only HLA-related categories are en- riched among the significant genes, pointing that immune response, in a broader sense, is enriched for LTBS (reviewed in Key et al., 2014b).

Regarding tissue of expression, among the genes overlapped by significant win- dows, we found enrichment for genes preferentially expressed in “adrenal” (TSI, p- value=0.003, S5 Table) and “lung” (GBR, p-value=0.004, S5 Table, S1 Text).

Overlap of significant windows across populations

Most windows were found to be significant (S20A Fig) or outliers (S20B Fig) in multiple populations. On average 81% of significant windows in any one population are shared between any two populations, and 69% of the windows are shared between two pop- ulations within the same continent (66% between African and 71% between European populations, see S20A Fig). For the more restrictive set of outlier windows, the shar- ing increased to 87% between any two populations, and 78% within continent (75% of

African windows were shared, and 80% of European (S20B Fig). There was also similar sharing considering t f values separately (S21-S22 Figs).

The putative function of balanced SNPs

Functional protein-coding sites To further investigate the differences among out- lier and non-outliers (background) windows, we examined the degree to which they overlap exons. On average, 31.2% of the windows that overlap protein-coding genes overlap their exons, very similar to the 30.8% for the background distribution (S15

Fig). In fact, significant windows contain a higher (but non-significant) proportion of protein-coding SNPs than background windows (Fig 6A,C).

83 Capítulo 1

When these sites are divided as synonymous (putatively neutral) and non-synonymous, significant windows are enriched for non-synonymous SNPs when compared with con- trols sampled from the background distribution (Fig 6A,C). This is also true when only intermediate frequency alleles are considered (MAF>0.20, Fig 6B,D). Taken together, our results indicate that balancing selection is associated to regions of increased non- synonymous polymorphism.

Regulatory function It has been suggested that balancing selection may have a par- ticularly important role in maintaining genetic diversity that affects gene expression

(Leffler et al., 2013; Savova et al., 2016). Because the identification of significant and outlier windows is independent from functional annotation, we are in a position to test the hypothesis that LTBS has preferentially targeted regulatory regions. Signifi- cant windows were enriched in SNPs that have regulatory functions (Fig 7A, p<0.001), annotated as eQTLs (Regulome Category 1).

Nevertheless, power to annotate a SNP as an eQTL increases with its frequency, so allele frequency must be accounted for. When only SNPs with intermediate frequency alleles are considered, significant windows no longer show a statistical enrichment in eQTLs (Fig 7D); rather, in most populations there is a significant depletion of eQTLs

(Fig 7D). Accordingly, we observed a depletion of SNPs overlapping putatively reg- ulatory regions when considering a more inclusive category that depends exclusively on genomic context (rather than on eQTL annotation, RegulomeDB categories 1 and 2, see Methods; Fig 7B-E). Regardless of allele frequency, SNPs in significant windows are enriched in sites with no evidence of a role in gene regulation (RegulomeDB category

7, Fig 7C-F). Although the annotation of each of these RegulomeDB categories is not perfect, these results suggest that balancing selection does not preferentially target, in human populations, sites with a role in gene expression regulation.

Finally, in agreement with Savova et al. (2016) we find a modest yet significant en- richment for genes with mono-allelic expression (MAE) among the outlier genes shared

84 Capítulo 1 by at least two populations (Table 3): 26% of them are MAE genes, while only 22% of the non-outliers are MAE (p=0.03, Fisher Exact Test, one-sided).

The top candidate genes

The signatures of long-term balancing selection may not be shared between popula- tions due to changes in selective pressure, which may be important during fast, local adaptation (Filippo et al., 2016). Still, loci with signatures across human populations are more likely to represent old, stable events of balancing selection in human populations.

We considered as “African” those outlier genes resulting from the union of outlier win- dows for all t f values (Table 2) that are shared between YRI and LWK (but neither or only one of the European populations), and as “European” those that shared between

GBR and TSI (but neither or only one of the African populations). Those shared by all four populations were considered as “African and European” (Table 3). Importantly, these designations do not imply that the genes referred to as “African” or “European” in Table 3 are putative targets of LTBS for only one of the continents, as there are power differences between Africa and Europe, particularly for t f = 0.3 (Fig 3, Table 1, S10 Fig), but rather serve the purpose of quantifying the extent of sharing across populations.

The combined set of “African” (69 genes), “European” (71 genes) and “African and

European” (75 genes) contains 213 genes ( 1.1% of all queried genes) (Table 3). When applying the same criteria for the significant windows, the set contains 1,470 genes ( 8% of all queried genes, see S2 Text and S8 Table). We focus the following discussion on the set of 213 outlier genes, since they constitute the most restricted set. Of these, 61 (29%) have evidence of balancing selection in at least one previous genome-wide analysis

(Andrés et al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013), and others were detected in individual gene studies (Table 3 and Discussion). Overall, about 70% of the outlier genes reported here have not been reported as having signatures of LTBS in previous studies.

85 Capítulo 1

Obviously, a given window can be significant for more than one t f value. Because our simulations suggest that the t f is informative about the frequency of the balanced allele, we use the lowest Ztf to assign a t f value to each window (for a given popula- tion), providing information on the nature of the SFS skew (S7 Table). For 50% of the outlier windows, the assigned t f is 0.3, and 36% have 0.5 as assigned t f ; only 14% have assigned t f = 0.4 (S6 Table).

Based on the p-values of the most extreme window for each of the outlier genes, we were able to rank them. The top 10 genes are highlighted in Table 3. Among the top ten candidates, two (HLA-DRB5 and HLA-DQA1) are related to adaptive immunity, two (PCDH15 and NDUFA10) are related to sensory perception of mechanical stimu- lus, including sound and two (PROKR2 and CPE) are related to neuropeptide signal- ing pathway. Six of the top 10 genes (PROKR2, HLA-DQA1, CPE, HLA-DRB5, LUZP2, and MYO3A) have been previously described as having signatures of LTBS in humans

(Table 3). The four among them that are novel (B4GALNT2, C1orf101, NDUFA10, and

PCDH15) are discussed in more detail in the discussion.

Table 3. Outlier genes All reported genes have are overlapped by at least one outlier window for at least one t f value (Table 2 and Methods). Outliers for both African populations (“African”), for both European populations (“European") or for all four populations ("African & European"). A version of this table with p-values and assigned t f values is provided in S7 Table. When a gene has been previously reported as having signatures of LTBS, the reference is provided. [A], Andrés et al. (2009), [D], DeGiorgio et al. (2014), [L], Leffler et al. (2013), [S], reported as being under balancing selection in Savova et al. (2016), [T] Tan et al. (2005). * top 10 most highly ranked genes (for YRI).

African European African and European

ABO[SG][S] AC121757.1 APBB1IP [D] ADAM29 ADAM12 B4GALNT2** AIM1 ADRA1D BICD1 ALDH1L1 AL590867.1 CAMK1D ALK ALG8[D] CDSN[D][A][S] ARHGAP8 ATP10D CHST11 ATXN3 B3GNTL1 CPE[D]

86 Capítulo 1

BCAR3 BICC1[D] CTNNA3 C1orf54 C1orf222 DMBT1 [D] C15orf48 CCDC169[D] EDARADD C1orf101* CCDC169-SOHLH2[D] EGLN3 C22orf34 CDH5 ERO1LB CCHCR1 CEP112[D] FAM101A CELSR1 CNR2 FAM19A5 CLDN16 CNTNAP2 FCER2 COG6 COL4A3[D] FMN2 COL25A1 CPNE4[D] GPR137B COMMD10 CRHR1 GPRIN3 CUBN[D] CSMD1[D] HBE1 ** DIRC3 DOK6 HBG2 ** DTNA FRAS1[D] HIP1 EPHA6 FXN[D] HLA-B[A][D][S] EXTL3 GABBR2 HLA-C [D] FRMD4B[D] GNA15 HLA-DPA1[D] GPR114 GRAMD4 HLA-DPB1 [D] GTF2IRD1 GRID1 HLA-DQA1[D]* HLA-DQA2 GRIP1 HLA-DQB1[D] IGFBP7[D,L] HLA-DRA[D] HLA-DQB2 IL37[S] HUS1[L] HLA-DRB1 [D] LGALS8[D,A,S] IDO2 HLA-DRB5 [D] LGI2 IL18R1[S] HLA-G LUZP222 IL1RL1[S] LRP1B [D] MLPH ITGA1[D] MANBA MYOM2[D] KALRN MMP26 NEDD4L KANSL1 BICD1 NFATC1 KDM4C[D] MROH2A NR3C1 KL[D] MYO3A[D] OR52A1 KRT83[D] MYRIP[D] OR6J1 LAMC2 NCAM2 PACRG[D] LDLRAD4[D] NDUFA10 PADI2 MYO7A OLAH PARP15 NRXN3 PARK2 PDE10A NSUN4 PCDH15* PGLYRP4[D] NTN4[D] PDSS1 PRR5-ARHGAP8 OAS1[S] PHACTR2 PTPRB[D] ORC5 PLCB4[D]

87 Capítulo 1

PTS OVCH1-AS1 PRDM15 RNF39 PGBD5[D] PREX2[A] RP11-96O20.4 PKHD1 PROKR2[L]* SFTPD POLR1E[D] PSORS1C1[D] SGCZ[D] PPAP2B RIMKLA SLC17A5 PSMC1 RP11-257K9.8 SLC22A16 RASSF6 SH3RF3 [D] SLC27A6 RBFOX1[D] SIRPA SLC35F2 REG4 SLA SPRR3 RUNX2 SNX19[D] SPTLC3 SEPT11 SPAG16 SQRDL SGCG[D] SPATA16 STAU2 SKAP2 SUCLG2 STK32A[D] SLC1A6 SV2C STXBP6 SLC2A9[D] [A][S] TG SUMF1 SOHLH2 THSD7B[D] TGM6 SVEP1 TMPRSS2 TMCO3 TCP11L1 TMTC2 TMEM132D TEC UGT2B4[S][T] TMEM135 TNN WNT7A WSCD1 TNS1[A] ZC3H12D WWOX TRIM5[S] ZNF385D ZNF331 TRPM3 ZNF83 ZNF670 VPS8 ZNF695 WDYHV1 ZZEF1 ZNF254

88 Capítulo 1 (orange), and significant windows for 0.4 2 NCD (blue), significant windows for GBR population of background windows (all windows in chromosome 1, in grey), 0.5 B) 2 NCD LWK population and (pink). A) 0.3 2 NCD Figure 5. Site frequency spectra SFS in significant windows for

89 Capítulo 1 infiatadotir,semi et ,uino l idw on ihaltre rqece ( frequencies target all with found windows all of union U, text. populations main across see genes outliers, protein-coding and and Significant windows outlier and Significant 2. Table Population Significant Significant windows windows windows Queried Outlier Outlier genes genes tf ,2 ,1 ,2 ,7 ,3 ,1 ,1 ,3 ,6 ,1 ,0 ,2 ,6 ,8 ,0 8,395 5,801 6,183 5,464 8,526 5,904 6,312 5,465 8,436 5,213 5,919 6,137 7,770 4,826 5,516 5,620 ,3 ,0 7 ,2 ,2 ,4 2 ,0 6 ,2 7 ,2 8 ,4 ,0 1,378 1,009 1,047 983 1,321 971 1,025 967 1,400 928 1,044 1,129 1,321 878 1,003 1,037 1 1 1 ,3 1 1 1 ,4 1 1 1 ,3 1 1 1 1,163 816 816 816 1,131 816 816 816 1,142 816 816 816 1,139 816 816 816 2 3 4 0 2 2 3 8 0 1 2 7 1 2 3 189 137 121 116 172 123 114 107 187 131 120 124 202 147 130 128 0.3 0.4 ,3,7 ,3,7 ,3,7 1,631,372 1,631,372 1,631,372 1,631,372 LWK 0.5 U 0.3 0.4 YRI 0.5 U 0.3 0.4 GBR 0.5 U 0.3 f t 0.4 ). TSI 0.5 U

90 Capítulo 1

Figure 6. Enrichment in protein-coding and non-synonymous SNPs Proportion of SNPs that are protein-coding (A,C) and proportion of protein- coding SNPs that are nonsynonymous (B,D) for all SNPs (A,B) or SNPs with intermediate MAF in the significant and background windows (C,D). NS, non- synonymous; S, synonymous; C, coding; I, intergenenic. In gray, distribution obtained from 1,000 samplings of a set of windows from the background (see Methods). In orange, significant windows.

91 Capítulo 1 iue7 euoeBercmn nlssfrsoe n 7 and in 1 SNPs scores of for Proportion analysis enrichment RegulomeDB 7. Figure rm100smlnso e fwnosfo h akrud(e ehd) noag,sgicn windows. significant orange, In Methods). (see background the from windows of set a of samplings 1,000 from (D,E,F) or site) regulatory putatively Nswt nemdaeMFi ohtesgicn n akrudwnos nga,dsrbto obtained distribution gray, In windows. background and significant the both in MAF intermediate with SNPs (A,D) euoeBctgr (eQTLs), 1 category RegulomeDB (C,F) euoeBctgr n vdnefrrgltr oe for role) regulatory for evidence (no 7 category RegulomeDB (B,E) euoeBctgre n oelpiga (overlapping 2 and 1 categories RegulomeDB (A,B,C) l Nsor SNPs all

92 Capítulo 1

Discussion

NCD Method

We present two new summary statistics, which are simple and fast to compute and to run, and which allow, unlike classical approaches such as Tajima’s D (Tajima, 1989) or the Mann-Whitney U for comparing local and global SFS (Andrés et al., 2009; Nielsen et al., 2009), explicit exploration of different target frequencies - a property also shared by the T1 and T2 tests (DeGiorgio et al., 2014), albeit in a likelihood framework. We show that the NCD statistics are well powered to detect balancing selection for a complex demographic scenario, such as that of human populations.

The NCD statistics can be used to detect selected regions using null distributions based on neutral simulations (identifying significant windows whose signatures are not expected under neutrality) or by an empirical outlier approach, which allows the investigation of balancing selection when there is limited knowledge on demographic history. Furthermore, NCD1 can be used in the absence of a close outgroup species, which extends further the set of possible species. This allows exploring the genomes of species where balancing selection remains completely unexplored.

Many previous and well-supported targets of balancing selection are present in our list of selected genes, but approximately 70% of the protein-coding genes we identify are novel candidate targets of balancing selection (Table 3).

Pervasiveness of LTSB in the human genome

On average 0.51% of the windows show, per population, signatures of LTBS that are significant under our simulation-based criteria: we never observed comparable signa- tures of LTBS with neutral simulations. We showed that these windows are unlikely to be affected by technical or biological artifacts.

93 Capítulo 1

Although the total proportion of the genome under balancing selection may be small, our results show that many genes contain putatively selected regions. For ex- ample, under a restrictive criterion of being significant in at least two populations from a continent, 8% of the protein-coding genes contain a significant window, and 1% con- tain an outlier window. Because our statistic is powerful for detecting selection in rel- atively narrow genomic regions (3kb), it is possible that we are identifying signatures that would not be found when analyzing properties of entire genes or larger genomic regions.

Protein-coding and intergenic targets

Long-term balancing selection is known to maintain both coding – e.g., HLA-B, HLA-C,

ABO (Hughes and Nei, 1988; Hughes and Nei, 1989; Segurel et al., 2012; Ségurel et al.,

2013) – and regulatory diversity – e.g. HLA-G, UGT2B (Tan et al., 2005; Sun et al., 2011)

(we confirm these targets in Table 3). We are in a particularly good position to quantify the relative proportion of cases of selection acting on coding or regulatory regions in humans. We found no excess of eQTLs within selected windows once the frequency of the alleles is accounted for, and also no evidence for enrichment of regulatory function.

A recent study suggested that there is enrichment for genes with mono-allelic ex- pression (MAE) among those with signatures of balancing selection (Savova et al.,

2016). In agreement with this observation, we found a small but significant enrichment for MAE genes among the outlier genes reported in Table 3 (p = 0.03, Fisher Exact Test).

We note that this overlap would be even greater if HLA genes had not been excluded by the MAE genes list provided in Savova et al. (2016). This result is consistent with the claim for a biological link between balancing selection and MAE (Savova et al., 2016).

Nevertheless, it remains elusive whether the detection of MAE genes is correlated with allelic frequency, as is the case for eQTLs which could this explain this enrichment.

Although significant windows show a depletion of overlap with protein-coding

94 Capítulo 1 genes more often than expected by chance, the proportion of the windows overlap- ping genes that overlap exons is the same for significant and background windows, showing that there is a depletion of introns in the significant windows. Finally, signif- icant and outlier windows show an enrichment for nonsynonymous SNPs. This result is compatible with two scenarios: (a) direct selection on multiple coding sites or (b) an accumulation of slightly deleterious variants as a bi-product of selection (e.g. Chun and

Fay, 2011).

The frequency of the balanced allele(s)

For both new and previously known targets, an advantage of our method is that it provides an assigned target frequency for each window, and consequently information on the shape of its SFS (Table 3, S7 Table). In some candidate genes – the HLA genes

– we know that LTBS has targeted not one, but several sites (Hughes and Nei, 1988;

Hughes and Nei, 1989). In this case, the theoretical expectation of the shape of the re- sulting local SFS is unclear. Nevertheless, in loci with a single balanced polymorphism, which we assume may be common outside of the MHC, our simulations suggest that the assigned t f can be informative about the frequency of the balanced allele. Our results indicate that a large proportion of significant windows (50 %) have minor al- lele frequencies which lie closer the target allele frequency of 0.3 than to 0.5, as would be expected, for instance, under asymmetric overdominance. This highlights the im- portance of considering balancing selection regimes with different frequencies of the balanced polymorphism.

The candidate genes

Whereas studies of positive selection show a remarkably low overlap with respect to the genes they identify, with Akey (2009) reporting that only 14% of protein-coding loci appear in more than a single study, we identified 61 of the outlier genes (29%) with

95 Capítulo 1 evidence of balancing selection in at least one previous genomic analyses (Andrés et al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013) (Table 3), and a few other genes detected in individual gene studies (Tables 3 and S7). This is a reasonable overlap as these studies used both different approaches and datasets.

Many candidates for balancing selection from previous studies are also identified here. For example, Leffler et al. (2013) identified 6 genes with particularly strong ev- idence of trans-species polymorphisms, 3 of which are outliers in our study (HUS1,

PROKR2, IGFBP7; Table 3). Of the 5 genes identified by both Andrés et al. (2009) (an -based approach) and DeGiorgio et al. (2014) (a genome-wide study), 4 are among our outlier genes (HLA-B, CDSN, LGALS8, SLC2A9; Table 3), and one (RCBTB1) among our significant genes (S8 Table). We find 2 additional genes from Andrés et al. (2009)

(PREX2 and TNS1; Table 3) and 53 genes from DeGiorgio et al. (2014) (Table 3).

Other outlier genes have prior evidence for balancing selection in candidate gene studies. Among the oldest known cases of genetic polymorphisms in humans are the blood-group genes (Segurel et al., 2012; Ségurel et al., 2013), including the ABO gene, which we also identify (Table 3). TRIM5 has prior evidence of balancing selection in humans and Old World Monkeys (Cagliani et al., 2010) and OAS1 since the split be- tween humans, chimpanzees and gorillas (reviewed in Fijarczyk and Babik, 2015); both are involved in innate immune defense. Additional examples include UGT2B4 - an enzyme that metabolizes steroid hormones and bile acids and is associated to predis- position to breast cancer (Sun et al., 2011) – and HLA-G – a non-classic HLA gene that has tightly-regulated expression patterns between fetal and adult life (Tan et al., 2005).

Among the top 10 ranked genes (which we manually checked for undetected dupli- cations and non-homologus gene conversion, S3 Text), we find B4GALNT2, C1orf101,

NDUFA10, and PCDH15. NDUFA10 produces a subunit of the enzyme NADH, the largest among the complexes of the electron-transport chain, and is associated to neu- rodegenerative diseases such as Leygh’s syndrome, Huntingston’s and Parkinson’s.

96 Capítulo 1

C1orf101 is a protein of unknown function which is highly expressed in human tes- ticular tissues (Petit et al., 2015). PCDH15 is a protocadherin protein that is essential for normal retinal and cochlear function. Interestingly, this gene shows strong signa- tures of positive selection in East Asian populations (Sabeti et al., 2007). Moreover, two other outlier genes (not among the top 10 ranked genes) are members of the beta-globin cluster and have evidence for recent positive selection in Andean (HBE1, HBG2) and Ti- betan populations (HBG2) (Bigham et al., 2010; Rottgardt et al., 2010; Yi et al., 2010). It is plausible that these genes have been under LTBS in Africa and Europe, and recently been subjected to strong positive selection in non-African populations, a pattern of shift in selective regime recently detected for other loci (e.g. Filippo et al., 2016).

Finally, B4GALNT2 encodes a blood-group enzyme that has evidence for trans- species polymorphism maintaining two classes of alleles with high divergence, which are responsible for alternative tissue-specific expression patterns (Linnenbrink et al.,

2011). Moreover, variation in this gene in mice seems to be associated with the pres- ence of Helicobacter species in the gut (Staubach et al., 2012; Ségurel et al., 2013). Finally, a deletion encompassing the first exon of this gene has been described and it is possible that it became fixed in chimpanzees by positive selection (Perry et al., 2008). To date, our study is the first to confirm evidence of LTBS on B4GALNT2 in humans.

Conclusions

We have developed a tool to identify genomic regions under long-term balancing selec- tion that is simple, fast, and has a high degree of sensitivity for different frequencies of the balanced polymorphism. The NCD statistics can be applied to single loci of to the whole genome, in species with sufficient demographic information and those without it, and both in the presence and in the absence of an appropriate outgroup.

Our analyses indicate that, in humans, balancing selection may be shaping variation

97 Capítulo 1 in about 0.5% of the genome including at least 1% of the human protein-coding genes.

Because there are so many genes, and since although they affect mostly immunity they also affect other pathways and phenotypes, we provide evidence that balanced poly- morphisms appear to be relevant to many biological processes.

Besides, we provide a catalog of candidate targets of long-term balancing selec- tion, including many completely novel targets. These shall be further investigated, for example, to infer the selective force maintaining the balanced polymorphisms, to de- termine their phenotypic consequences in present-day human populations. Although about 80% of windows are shared across populations, the remaining show signatures in individual populations; these will be particularly interesting to investigate their pu- tative influence in subsequent local adaptations through shifts in selective pressure (as in Filippo et al., 2016).

Materials and Methods

Simulations

Performance of NCD2 and NCD1 was evaluated by extensive simulations with MSMS

(Ewing and Hermisson, 2010). The simulations followed a realistic demographic model for African, European and East Asian human populations described in Gravel et al.

(2011), including the effective populations sizes (Ne) and migration rates. A generation time of 25 years, a mutation rate of 2.5 ∗ 10−8 mutations per site (Nachman and Crowell,

2000) and a recombination rate of 1 ∗ 10−8 were used. The human-chimpanzee split at

6.5 million years ago was added to the model. This was our null demographic model

(Fig 2), used to obtain the neutral distributions of NCD.

For simulations with selection, a balanced polymorphism was added to the center of the simulated sequence. The frequency equilibrium ( feq) achieved by the balanced

98 Capítulo 1 polymorphism was modeled following an overdominant model, as follows. Under the overdominance model, for a bi-allelic locus with alelles A and B, the relative fitnesses of the three genotypes are: wAA = 1 − s1, wAB = 1, and wBB = 1 − s2, where s1 and s2 are the selection coefficients in the two homozygous genotypes, and the frequency equilibrium ( feq) is equal to s1/(s1 + s2), as in Equation 4.

In MSMS, in order to achieve the feq we are interested in, we parameterized selec- tion in the following way: w = 1 + (2N s), w = 1 + [2w − wAB ], and w = 1, AB e BB AB 1− feq AA where Ne is the effective population size used to scale the coalescent simulations and s is the selection coefficient for the mutant allele (B). The selection coefficient (s) was set to 0.01 (the influence of s is modest once the frequency equilibrium is reached, as in the case of LTBS). We considered four frequency equilibria: feq = 0.2, 0.3, 0.4, 0.5. Simula- tions with and without selection were run for different sequence lengths (L), such that

L = 3, 6, 12 kb and time of onset of balancing selection (Tbs), such that Tbs = 1, 3, 5 myr

(Fig 2).

Power analyses

For each set of parameters, 1,000 neutral simulations were compared to 1,000 match- ing simulations with balancing selection for evaluation of the performance of the NCD statistics. The relationship between the true positive (TPR, the power of the statistic) and false positive (FPR) rates is represented through receiver operating characteristic

(ROC) curves. For comparisons between statistics and across demographic scenarios,

NCD implementations (NCD1 and NCD2) and other parameters, the power at the FPR

= 0.05 threshold was considered. When comparing performance under a given con- ditions (e.g. L values), power values were averaged across implementations (NCD1 and NCD2), demographic scenarios (Africa, Europe, Asia), and the other parameters, unless explicitly stated otherwise.

The same simulations and procedures were used to evaluate the comparative per-

99 Capítulo 1 formance of the different methods. NCD2 and NCD1 were run using 3kb windows and L=3kb and Tbs = 5 myr. They were compared with Tajima’s D (TajD), HKA (Tajima,

1989; Hudson et al., 1987), and the combined NCD1+HKA test (a joint distribution of the two summary statistics) also in 3kb windows. DeGiorgio et al. (2014) report the performance of T1/T2 based on windows of 100 informative sites upstream and down- stream of the target site (on average 13.7 Kb in YRI and 14.7 Kb in CEU). Therefore, we divided 15kb simulations in windows of 100 informative sites and calculated T1 and

T2 using BALLET (DeGiorgio et al., 2014). We selected the highest T1 or T2 value from each simulations to perform the power evaluation.

Human population genetic data and filtering

Data We analyzed genome-wide data from the 1000 Genomes Project phase I (Abeca- sis et al., 2012). SNPs that were only detected in the high coverage exome sequencing of the 1000G were not considered because the difference in coverage between the low versus high coverage-exclusive SNPs make the exome dataset biased in the sense that coding regions have higher SNP density, potentially biasing our results.

The genomes of individuals from African and European populations were queried

(excluding the recently admixed AWS population), but not those from Asian popula- tions due to lower performance in this population (see “Demography” in the Results section). We considered two African populations (YRI and LWK), and two European populations (GBR and TSI). For comparisons between continents only two European populations were considered (GBR and TSI).

To equalize sample size, we randomly sampled 50 unrelated individuals from each population (Key et al., 2014a). We used the minor allele frequency (MAF) in the NCD statistics calculations to analyze the folded SFS (Fig 1). This allows us to retain SNPs where the ancestry could not be determined by the 1000G.

100 Capítulo 1

Filtering Genome analyses require extensive filtering in order to avoid the inclusion of errors that may bias the results. We dedicated extensive efforts to obtain a filtered dataset (see Fig S13). We disregarded positions not present in the 50mer CRG Alignabil- ity track (Derrien et al., 2012), which requires that 50 bp segments should map uniquely

(only one region of the genome, allowing up to two mismatches). We filtered out all re- gions annotated as segmental duplications (Alkan et al., 2009; Cheng et al., 2005) and positions that are simple units of repeat detected by the Tandem Repeat Finder (Benson,

1999). We also required that all scanned positions be orthologous to the PanTro2 chim- panzee reference sequence, because NCD2 includes FDs. After this filtering, NCD2 was calculated for the remaining windows (1,705,970 windows per population).

Identifying signatures of LTBS

Because L = 3 Kb yielded the best performance for NCD2 for both African and Euro- pean simulations (see Results, Figs 3, S1, S2), we queried the human population genetic data with sliding windows of L = 3 Kb with 1.5 Kb step size. Windows are defined in physical distance since the presence of balancing selection may affect the population- based estimates of recombination rate. Variable positions were categorized as a SNP (if polymorphic in the sample) or a FD (if all humans differ from the chimpanzee); the only exception are polymorphic sites where both allelic states differ from the chimpanzee reference state, as this position was considered both a SNP and a FD. Each population was queried separately, and NCD2 was calculated considering three target frequencies:

0.3, 0.4, 0.5. For each queried window, the number of SNPs, FDs, IS, SNP/(FD+1) and

NCD2 (t f = 0.3, 0.4, 0.5) was computed for each window.

Filtering and correction for number of informative sites (IS) Neutrality tests typically place a threshold on the minimum number of informative sites necessary – e.g. at least 10 IS in Andrés et al. (2009), and 100 informative sites in DeGiorgio et al. (2014).

101 Capítulo 1

We observe considerable variance in the number of IS per 3 Kb window in the real human genomic data, and find NCD2 has high variance when the number of IS is low

(S18 Fig). We therefore required that each window has at least 19 (African populations) or 15 (European populations) IS, and the same sets of windows were queried in all

4 populations (Figs S9 and S10). These values where chosen because beyond them

NCD2 stabilizes (Figs S18 and S19). This final filter resulted in 1,631,372 considered windows (4% of the queried windows were excluded) (Fig S13). Furthermore, neutral simulations with different mutation rates were performed in order to retrieve 10,000 simulations for each bin of IS ranging from 4-229 (Africa) and 4-199 (Europe); this range is compatible with the range seen in the actual data. Next, NCD2 (t f = 0.3,0.4,0.5) was calculated for all simulations. These simulations per bin of IS allowed both the assignment of significant windows, and the calculation of Zt f (Equation 4, see below).

Significant windows We defined two sets of windows with signatures of LTBS: the significant windows (obtained based on the simulations) and the outlier windows

(obtained based on the empirical distribution). The significant windows were defined as those that fulfill the criterion whereby the observed NCD2t f value is lower than any of the 10,000 values obtained for simulations with the same number of IS. Based on this criterion, all significant windows have the same p-value (p < 0.0001).

Outlier windows In order to rank the queried windows and apply an outlier ap- proach, we developed a standardized distance measure between the observed NCD2t f

(for the queried window) and the mean of the NCD2t f values for the 10,000 simulations for the matching number of IS. This distance (Ztf) is given by:

NCD2t f − NCD2IS Zt f = (4) sdIS

, where Ztf is the corrected NCD2t f distance by the number of IS, NCD2t f is the

NCD2 value for the n-th empirical window, NCD2IS is the mean NCD2 for 10,000

102 Capítulo 1

neutral simulations for the corresponding value of IS, and sdIS is the standard deviation of NCD2 for 10,000 simulation values with the matching number of IS.

This standardized distance measure takes into account the range of possible values within each IS value, and also the different ranges of values across different target fre- quencies. Therefore, Zt f allows not only the ranking of all windows for a given t f , but also takes into account the residual effect that the number of IS has on NCD2t f (even after filtering for a minimum number of IS, see S11 and S18 Figs) and, finally, allows a comparison between the rankings of a window considering different target frequencies.

Once the Zt f scores were calculated, the outlier sets of windows were ranked according to Z0.5, Z0.4, and Z0.3. An empirical p-value was attributed to each window based on the Zt f values, and the windows corresponding to the 0.05% lower tail (816 windows) of the genomic distribution of Zt f values were defined as the “outlier windows”. All outlier windows are contained within the significant windows except four windows in

LWK, and one window in TSI.

Assigned t f values As mentioned above, the p-values obtained from Zt f can be directly compared across t f values. When a window is an outlier for several t f values, this property allowed an assignment, for each window, of the t f value that minimizes the NCD2t f . For the significant and outlier windows, we assigned a t f value as the t f that yields the lowest empirical p-value for the window (S6 Table). For the outlier genes in Table 3, a t f value was assigned to a gene by asking: (1) which window overlapping the gene has the lowest p-value; and (2) which t f value is associated with the p-value in 1. Thus, the assigned t f value for a gene is the assigned t f for the window that has the lowest empirical p-value. This was done for each population separately as seen in

S7 Table.

Coverage To test whether the signatures of LTBS are driven by undetected duplica- tions, which can produce mapping error and false SNPs, we analyzed modern human

103 Capítulo 1 shotgun genome-wide data that has been sequenced to an average coverage per indi- vidual between 20x and 30x (Meyer et al., 2012; Prüfer et al., 2013). We used an indepen- dent dataset because read coverage data is low and cryptic in the 1000G and because putative duplications that affect the SFS must be at appreciable frequency and should be present in other datasets. We considered 12 genomes, two genomes per population, and two populations per continent: Yoruba and San (Africa), French and Sardinian

(Europe), Dai and Han Chinese (Asia).

For each sample, we retrieved the positions that have coverage higher than the

97.5% of the coverage distribution specific for that sample (termed “high coverage” positions). For each window in our analysis for signatures of LTBS, we calculated the proportion of positions having high coverage in at least two samples (pHC), and plot the distributions for different NCD2 empirical p- values – i.e, those based on the Zt f scores (S14 Fig). Our significant and outlier windows are not enriched in positions with high coverage in the samples considered herein, but rather the opposite: the significant windows show a significant reduction in the proportion of positions with high coverage when compared with non-significant windows (all Mann-Whitney U test two-tail p- value < 0.001).

Enrichment Analyses

Gene and Phenotype Ontology Whenever a candidate window overlaps a pro- tein coding gene to any extent, this gene is considered a “candidate gene”. This in- cludes windows that fall within intronic regions. GO (gene ontology) and PO (phe- notype ontology) enrichment analyses were performed using the software GOWINDA

(Kofler and Schlötterer, 2012), which avoids common biases that result from gene length

(longer genes with more windows have by chance a higher probability of containing a candidate window) and/or gene clustering. We ran the analysis in mode: gene and per- formed 100,000 simulations for FDR estimation. Significant categories were obtained by

104 Capítulo 1 considering an FDR<0.05.

GOWINDA was designed for SNP-based analysis so we considered the middle po- sition of every scanned window as the target site. To correct for this, we extended gene coordinates by 1500bp up/down-stream by using the option updownstream1500 in SNP to gene mapping, so we consider the correct coordinates of each window. We used the annotation file (.gtf) and the gene set file for Gene Ontology from Ensembl

(http://www.ensembl.org/index.html), and the Phenotype Ontology file from the Hu- man Phenotype Ontology database (http://human-phenotype-ontology.github.io/). Sep- arate analyses were performed for each population and considering a combination of different sets of genes: 1) different types of candidate windows (outliers vs significant windows); 2) different t f (0.5, 0.4 and 0.3); 3) the union of candidate windows for all t f ; 4) excluding the classical HLA genes with previous evidence of balancing se- lection (HLA-B, HLA-C, HLA-DRB1, HLA-DRB5, HLA-DPA1, HLA-DPA2, HLA-DPB1,

HLA-DPB2, HLA-DQB1, HLA-DQB2, HLA-DQA1, HLA-DQA2).

Enrichment for coding and non-synonymous SNPs We used annotation from the 1000 Genomes (Abecasis et al., 2012) to define coding, intergenic, synonymous and non-synonymous SNPs. Every SNP used in NCD2 calculation and overlapping NCD2- analyzed windows was considered in this analysis. A GOWINDA re-sampling ap- proach as described above was used to perform the enrichment analysis. To control for possible effects of allele frequency on the enrichment for specific features such as eQTLs, a separate analysis only included SNPs at intermediate frequencies (MAF >=

20%) in each of the four populations.

RegulomeDB To test for enrichment of putatively regulatory sites among targets of balancing selection we used RegulomeDB, which is a SNP-based annotation for known and predicted regulatory elements (Boyle et al., 2012). Specifically, we considered Regu- lomeDB scores of 1 (eQTL + t f binding + matched t f motif + matched DNase Footprint

105 Capítulo 1

+ DNase peak) score 2 (TF + binding + matched t f motif + matched DNase Footprint

+ DNase peak), and 1+2 together (sites that are annotated as eQTL and those that are not), as well as 7 (no regulatory annotation). These represent SNPs with the highest and the lowest evidence for regulatory function, respectively. We also considered score

2 alone (TF binding + matched t f motif + matched DNase Footprint + DNase peak).

For each candidate window we sum the number of SNPs with each score that over- lap the window. The expectation in the absence of LTBS is obtained by randomly sam- pling from the genome the same number of windows as there are with evidence for

LTBS (Table 2). This enabled the calculation of an empirical p-value of the enrichment of RegulomeDB scores in candidate windows when compared with the empirical back- ground distribution while accounting for the size of each candidate windows set (sig- nificance when p < 0.05). Because we considered the sum of scores across all windows, considering each SNP only once even if it overlapped more than one window, our strat- egy is insensitive to window length. We conducted similar analyses by considering only alleles found at intermediate frequencies (MAF >= 20%) as described above.

Immune-related genes To specifically test for enrichment for significant genes re- lated to immunity, we used a list of 386 immune-related keywords from the Compre- hensive List of Immune Relate Genes from Immport (https://immport.niaid.nih. gov/) to query the GO categories of the outlier genes. In total, 200 out of our 212 out- lier genes have at least one associated GO category, of which 62 have at least one GO category that matches at least one of the keywords on the list and was thus considered to be “immune-related”.

106 References

Abecasis, G. R., A. Auton, L. D. Brooks, M. a. DePristo, R. M. Durbin, R. E. Handsaker, H. M. Kang, G. T. Marth, and G. A. McVean (2012). “An integrated map of genetic variation from 1,092 human genomes.” In: Nature 491 (7422), pp. 56–65. Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we go from here?” In: Genome Research 19 (5), pp. 711–722. Alkan, C. et al. (2009). “Personalized copy number and segmental duplication maps using next- generation sequencing”. In: Nature Genetics 41 (10), pp. 1061–1067. Allison, A. C. and D. F. Clyde (1961). “Malaria in African Children with Deficient Erythrocyte Glucose-6-phosphate Dehydrogenase”. In: British Medical Journal 1 (5236), pp. 1346–1349. Alonso, S., S. Lopez, N. Izagirre, and C. de la Rua (2008). “Overdominance in the Human Genome and Olfactory Receptor Activity”. In: Molecular Biology and Evolution 25 (5), pp. 997– 1001. Anders, S. and W. Huber (2010). “Differential expression analysis for sequence count data”. In: Genome Biology 11 (10), R106. Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. In: eLS, pp. 1–8. Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” In: Molecular Biology and Evolution 26 (12), pp. 2755–64. Asthana, S., S. Schmidt, and S. R. Sunyaev (2005). “A limited role for balancing selection”. In: Trends in genetics : TIG 21 (1), pp. 30–32. Benson, G. (1999). “Tandem repeats finder: a program to analyze DNA sequences”. In: Nucleic acids research 27 (2), pp. 573–580.

107 Capítulo 1

Bergland, A. O., E. L. Behrman, K. R. O’Brien, P. S. Schmidt, and D. A. Petrov (2014). “Ge- nomic Evidence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila”. In: PLoS Genetics 10 (11), e1004775. Biasin, M. et al. (2007). “Apolipoprotein B mRNA—Editing Enzyme, Catalytic Polypeptide—Like 3G: A Possible Role in the Resistance to HIV of HIV-Exposed Seronegative Individuals.” In: Journal of Infectious Diseases 195 (7), pp. 960–964. Bigham, A. et al. (2010). “Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data”. In: PLoS Genetics 6 (9). Ed. by D. J. Begun, e1001116. Boyle, A. P. et al. (2012). “Annotation of functional variation in personal genomes using Regu- lomeDB”. In: Genome Research 22 (9), pp. 1790–1797. Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing selection.” In: Genetics 173 (4), pp. 2165–77. Cagliani, R., M. Fumagalli, M. Biasin, L. Piacentini, S. Riva, U. Pozzoli, M. C. Bonaglia, N. Bresolin, M. Clerici, and M. Sironi (2010). “Long-term balancing selection maintains trans- specific polymorphisms in the human TRIM5 gene.” In: Human genetics 128 (6), pp. 577– 88. Cagliani, R., S. Riva, U. Pozzoli, M. Fumagalli, G. P. Comi, N. Bresolin, M. Clerici, and M. Sironi (2011). “Balancing selection is common in the extended MHC region but most alleles with opposite risk profile for autoimmune diseases are neutrally evolving”. In: BMC Evolutionary Biology 11 (1), p. 171. Charlesworth, B. and D. Charlesworth (2010). Elements of Evolutionary Genetics. 1st ed. Roberts

and Company Publishers, p. 768. ISBN: 0981519423. Charlesworth, B., M. Nordborg, and D. Charlesworth (1997). “The effects of local selection, bal- anced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided population”. In: Genetical Research 70, pp. 155–174. Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome regions.” In: PLoS Genetics 2 (4), pp. 379–384. Cheng, Z. et al. (2005). “A genome-wide comparison of recent chimpanzee and human segmen- tal duplications”. In: Nature 437 (7055), pp. 88–93.

108 Capítulo 1

Chun, S. and J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the human genome.” In: PLoS genetics 7 (8), e1002240. Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. In: Taxon- omy and Geography. Ed. by D. Nichols. Oxford: Systematics Association. — (1964). “Frequency-Dependent Selection for the Dominance of Rare Polymorphic Genes”. In: Evolution 18 (3), pp. 364–369. Coventry, A. et al. (2010). “Deep resequencing reveals excess rare recent variants consistent with explosive population growth”. In: Nature Communications 1 (8), p. 131. Day, F. R. et al. (2015). “Causal mechanisms and balancing selection inferred from genetic asso- ciations with polycystic ovary syndrome”. In: Nature Communications 6, p. 8464. DeGiorgio, M., K. E. Lohmueller, and R. Nielsen (2014). “A model-based approach for iden- tifying signatures of ancient balancing selection in genetic data.” In: PLoS genetics 10 (8), e1004561. Derrien, T., J. Estellé, S. Marco Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca (2012). “Fast Computation and Applications of Genome Mappability”. In: PLoS ONE 7 (1). Ed. by C. A. Ouzounis, e30377. Ewing, G. and J. Hermisson (2010). “MSMS: a coalescent simulation program including recom- bination, demographic structure and selection at a single locus”. In: Bioinformatics 26 (16), pp. 2064–2065. Fijarczyk, A. and W. Babik (2015). “Detecting balancing selection in genomes: Limits and prospects”. In: Molecular Ecology, n/a–n/a. Filippo, C. de, F. M. Key, S. Ghirotto, A. Benazzo, J. R. Meneu, A. Weihmann, G. Parra, E. D. Green, and A. M. Andrés (2016). “Recent Selection Changes in Human Genes under Long- Term Balancing Selection”. In: Molecular Biology and Evolution, msw023. Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs, and C. D. Bustamante (2011). “Demographic history and rare allele sharing among human populations.” In: Proceedings of the National Academy of Sciences of the United States of America 108 (29), pp. 11983–8. Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante (2009). “Infer- ring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data”. In: PLoS Genetics 5 (10). Ed. by G. McVean, e1000695.

109 Capítulo 1

Hedrick, P. W. (2006). “Genetic Polymorphism in Heterogeneous Environments: The Age of Genomics”. In: Annual Review of Ecology, Evolution, and Systematics 37, pp. 67–93. — (2012). “What is the evidence for heterozygote advantage selection?” In: Trends in Ecology & Evolutiony & evolution 27 (12), pp. 698–704. Hedrick, P. W., T. S. Whittam, and P. Parham (1991). “Heterozygosity at individual sites: extremely high levels for HLA-A and -B genes.” In: Proceedings of the National Academy of Sciences 88 (13), pp. 5897–5901. Howell, W. M. (2014). “HLA and disease: Guilt by association”. In: International Journal of Im- munogenetics 41 (1), pp. 1–12. Hudson, R. R. and N. L. Kaplan (1988). “The coalescent process in models with selection and recombination”. In: Genetics 120 (3), pp. 831–840. Hudson, R. R., M. Kreitman, and M. Aguade (1987). “A Test of Neutral Molecular Evolution Based on Nucleotide Data”. In: Genetics 116 (1), pp. 153–159. Hughes, A. L. and M. Nei (1989). “Evolution of the major histocompatibility complex: inde- pendent origin of nonclassical class I genes in different groups of .” In: Molecular Biology and Evolution 6 (6), pp. 559–79. Hughes, A. L. and M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility class I loci reveals overdominant selection”. In: Letters to Nature 335 (8), pp. 167–170. Johnston, S. E., J. Gratten, C. Berenos, J. G. Pilkington, T. H. Clutton-Brock, J. M. Pemberton, and J. Slate (2013). “Life history trade-offs at a single locus maintain sexually selected genetic variation”. In: Nature 502 (7469), pp. 93–95. Key, F. M., B. Peter, M. Y. Dennis, E. Huerta-Sánchez, W. Tang, L. Prokunina-Olsson, R. Nielsen, and A. M. Andrés (2014a). “Selection on a Variant Associated with Improved Viral Clear- ance Drives Local, Adaptive Pseudogenization of Interferon Lambda 4 (IFNL4).” In: PLoS genetics 10 (10), e1004681. Key, F. M., J. C. Teixeira, C. de Filippo, and A. M. Andrés (2014b). “Advantageous diversity maintained by balancing selection in humans”. In: Current Opinion in Genetics & Development 29, pp. 45–51. Kofler, R. and C. Schlötterer (2012). “Gowinda: Unbiased analysis of gene set enrichment for genome-wide association studies”. In: Bioinformatics 28 (15), pp. 2084–2085.

110 Capítulo 1

Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582. Linnenbrink, M., J. M. Johnsen, I. Montero, C. R. Brzezinski, B. Harr, and J. F. Baines (2011). “Long-Term Balancing Selection at the Blood Group-Related Gene B4galnt2 in the Genus Mus (Rodentia; Muridae)”. In: Molecular Biology and Evolution 28 (11), pp. 2999–3003. Liu, X. et al. (2006). “An ancient balanced polymorphism in a regulatory region of human ma- jor histocompatibility complex is retained in Chinese minorities but lost worldwide.” In: American Journal of Human Genetics 78 (3), pp. 393–400. Malaria Genomic Epidemiology Network (2015). “A novel locus of resistance to severe malaria in a region of ancient balancing selection”. In: Nature 526 (7572), pp. 253–257. Meyer, D., R. M. Single, S. J. Mack, H. A. Erlich, and G. Thomson (2006). “Signatures of demo- graphic history and natural selection in the human major histocompatibility complex Loci.” In: Genetics 173 (4), pp. 2121–2142. Meyer, D. and G. Thomson (2001). “How selection shapes variation of the human major histo- compatibility complex: a review.” In: Annals of Human Genetics 65 (1), pp. 1–26. Meyer, M. et al. (2012). “A High-Coverage Genome Sequence from an Archaic Denisovan Indi- vidual”. In: Science 338 (6104), pp. 222–226. Nachman, M. W. and S. L. Crowell (2000). “Estimate of the Mutation Rate per Nucleotide in Humans”. In: Genetics 156 (1), pp. 297–304. Nielsen, R. et al. (2005). “A Scan for Positively Selected Genes in the Genomes of Humans and Chimpanzees”. In: PLoS Biology 3 (6), e170. Nielsen, R. et al. (2009). “Darwinian and demographic forces affecting human protein coding genes.” In: Genome Research 19 (5), pp. 838–49. Pasvol, G., D. J. Weatherall, and R. J. M. Wilson (1978). “Cellular mechanism for the protective effect of haemoglobin S against P. falciparum malaria”. In: Nature 274 (5672), pp. 701–703. Perry, G. H. et al. (2008). “Copy number variation and evolution in humans and chimpanzees”. In: Genome Research 18 (11), pp. 1698–1710. Petit, F. G., C. Kervarrec, S. P. Jamin, F. Smagulova, C. Hao, E. Becker, B. Jegou, F. Chalmel, and M. Primig (2015). “Combining RNA and Protein Profiling Data with Network Interactions Identifies Genes Associated with Spermatogenesis in Mouse and Human”. In: Biology of Reproduction 92 (3), pp. 71–71.

111 Capítulo 1

Prüfer, K. et al. (2013). “The complete genome sequence of a Neanderthal from the Altai Moun- tains”. In: Nature 505 (7481), pp. 43–49. Prugnolle, F., A. Manica, M. Charpentier, J. F. Guégan, V.Guernier, and F. Balloux (2005). “Pathogen- driven selection and worldwide HLA class I diversity.” In: Current Biology 15 (11), pp. 1022– 7. Rasmussen, M. D., M. J. Hubisz, I. Gronau, and A. Siepel (2014). “Genome-Wide Inference of Ancestral Recombination Graphs”. In: PLoS Genetics 10 (5). Ed. by G. Coop, e1004342. Raychaudhuri, S. et al. (2012). “Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis”. In: Nature Genetics 44 (3), pp. 291–296. Rottgardt, I., F. Rothhammer, and M. Dittmar (2010). “Native highland and lowland popula- tions differ in γ-globin gene promoter polymorphisms related to altered fetal hemoglobin levels and delayed fetal to adult globin switch after birth”. In: Anthropological Science 118 (1), pp. 41–48. Sabeti, P. C. et al. (2007). “Genome-wide detection and characterization of positive selection in human populations.” In: Nature 449 (7164), pp. 913–8. Sanchez-Mazas, A. (2007). “An apportionment of human HLA diversity”. In: Tissue Antigens 69, pp. 198–202. Savova, V., S. Chun, M. Sohail, R. B. McCole, R. Witwicki, L. Gai, T. L. Lenz, C.-t. Wu, S. R. Sunyaev, and A. A. Gimelbrant (2016). “Genes with monoallelic expression contribute dis- proportionately to genetic diversity in humans”. In: Nature Genetics 48 (3), pp. 231–237. Ségurel, L., Z. Gao, and M. Przeworski (2013). “Ancestry runs deeper than blood: The evolu- tionary history of ABO points to cryptic variation of functional importance”. In: BioEssays 35 (10), pp. 862–867. Segurel, L. et al. (2012). “The ABO blood group is a trans-species polymorphism in primates”. In: Proceedings of the National Academy of Sciences 109 (45), pp. 18493–18498. Sellis, D., B. J. Callahan, D. a. Petrov, and P. W. Messer (2011). “Heterozygote advantage as a natural consequence of adaptation in diploids”. In: Proceedings of the National Academy of Sciences 108 (51), pp. 20666–20671. Solberg, O. D., S. J. Mack, A. K. Lancaster, R. M. Single, Y. Tsai, A. Sanchez-Mazas, and G. Thomson (2008). “Balancing selection and heterogeneity across the classical human leuko-

112 Capítulo 1

cyte antigen loci: A meta-analytic review of 497 population studies”. In: Human Immunology 69 (7), pp. 443–464. Spurgin, L. G. and D. S. Richardson (2010). “How pathogens drive genetic diversity: MHC, mechanisms and misunderstandings.” In: Proceedings. Biological sciences / The Royal Society 277 (1684), pp. 979–88. Staubach, F., S. Künzel, A. C. Baines, A. Yee, B. M. McGee, F. Bäckhed, J. F. Baines, and J. M. Johnsen (2012). “Expression of the blood-group-related glycosyltransferase B4galnt2 influ- ences the intestinal microbiota in mice”. In: The ISME Journal 6 (7), pp. 1345–1355. Sun, C., D. Huo, C. Southard, B. Nemesure, A. Hennis, M. Cristina Leske, S.-Y. Wu, D. B. Witon- sky, O. I. Olopade, and A. Di Rienzo (2011). “A signature of balancing selection in the region upstream to the human UGT2B4 gene and implications for breast cancer risk”. In: Human Genetics 130 (6), pp. 767–775. Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly- morphism.” In: Genetics 123 (3), pp. 585–595. Tan, Z., A. M. Shon, and C. Ober (2005). “Evidence of balancing selection at the HLA-G pro- moter region”. In: Human Molecular Genetics 14 (23), pp. 3619–3628. Teixeira, J. C. et al. (2015). “Long-Term Balancing Selection in LAD1 Maintains a Missense Trans- Species Polymorphism in Humans, Chimpanzees, and Bonobos”. In: Molecular Biology and Evolution 32 (5), pp. 1186–1196. Vernot, B. and J. M. Akey (2014). “Resurrecting Surviving Neandertal Lineages from Modern Human Genomes”. In: Science 343 (6174), pp. 1017–1021. Williamson, S. H., M. J. Hubisz, A. G. Clark, B. A. Payseur, C. D. Bustamante, and R. Nielsen (2007). “Localizing recent adaptive evolution in the human genome.” In: PLoS genetics 3 (6), e90. Yi, X. et al. (2010). “Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude”. In: Science 329 (5987), pp. 75–78.

113 Capítulo 1

Supplementary Text

S1 Text: Additional analyses for significant and outlier windows and genes

Ruling out possible biological confounding factors

In all the analyses below, the set of significant or outlier windows (or genes) consists on the union of windows or genes overlapped by them considering all t f values.

Neanderthal introgression

Background Genomic segments that contain introgressed haplotypes from archaic human forms (Meyer et al., 2012; Prüfer et al., 2013) have, on average, older TMRCA and higher di- versity than the rest of the genome. In the absence of positive (or balancing) selection, though, introgressed segments are not expected to reach intermediate frequencies and contribute to the significant and outlier windows defined in the main paper.

Results Accordingly, the significant and outlier windows in European populations are not enriched in putatively introgressed SNPs (defined as those with an allele absent in the Africans, shared between Europeans and Neanderthals, and that fall in previously identified introgressed regions (Vernot and Akey, 2014) (S17 Fig and Methods in main paper). In fact, the outlier win- dows are significantly depleted of introgressed SNPs (S17 Fig).

Methods We tested the enrichment of Neanderthal introgression among candidate win- dows in TSI and GBR by using the resampling approach described for RegulomeDB functional enrichment analysis. Putative Neanderthal-introgressed SNPs were ascertained by considering SNPs that overlap annotated Neanderthal introgressed haplotypes in (Vernot and Akey, 2014) and that in the 1000 genomes data show a derived allele shared between TSI/GBR and Nean- derthals and absent in YRI and LWK (Abecasis et al., 2012). The remaining SNPs overlapping scanned windows were considered as non-introgressed.

114 Capítulo 1

Non-homologous gene conversion

Results and Methods We also investigated the possibility of non-homologous gene con- version, which is another biological phenomenon that may increase diversity. To do so, for each significant or outlier gene (see Table 2 in main text) we analyzed the distribution of the number of paralogs that reside on the same chromosome. Significant genes show no tendency towards having more paralogs on the same chromosome than all autosomal genes (see S16 Fig), show- ing that this is not a general issue. In both cases, more than 60% of the genes have no paralogs on the same chromosome (S16 Fig). We nevertheless singled out olfactory receptor (OR) genes (see below), which often appear in tandem and may undergo gene conversion. Unlike the other significant and background genes, more than 80% of the OR genes present in all populations for at least one t f value have at least one paralog on the same chromosome (S19 Fig). Thus, non- homologous gene conversion does not appear to be a general issue among significant genes, with the exception of the OR genes.

Olfactory receptor genes

Among the windows believed to have less false positive candidates to LTBS, only two OR genes are present: OR52A1 and OR6J1 (Table 3). Although patterns compatible with overdominance have been reported for human olfactory receptor activity genes (Alonso et al., 2008), we cannot rule out that the enrichment signature detected in the genes pertaining to the olfactory receptor (OR) gene family is due to paralogous gene conversion (S19 Fig). Moreover, OR52A1 has 10 paralogues on the same chromosome, and OR6J1 has 1 (Table 3). We therefore recommend that the results concerning OR genes be interpreted with caution.

Phenotype Ontology Analyses

Results A phenotype ontology analysis uncovered “abnormality of the sclera” as the only significant category in YRI, and no significant categories appear in the other three populations analyzed (S4 Table).

Methods See Methods section in main paper (“Gene and phenotype ontology”).

115 Capítulo 1

Tissue-specific expression

Results Interestingly, though, when we perform enrichment analysis among significant win- dows for tissue-specific expression, we find that targets of LTBS are significantly enriched in genes that are highly expressed in adrenal gland in both TSI and in the lung in GBR European populations (S5 Table). These results are mirrored in African populations when considering outlier windows, albeit the results are not significant.

Methods We performed an enrichment analysis using genes showing tissue-specific expres- sion also using GOWINDA, with the same parameters and strategy described above, except here we only considered two criteria for defining different sets in each population: 1) different ascertainment of candidate windows (outlier vs significant); 2) the union of windows for dif- ferent target frequencies. We used Illumina BodyMap 2.0 (Derrien et al., 2012) expression data for 16 different tissues, and developed a tissue-specific expression metric that considers genes that are significantly higher expressed in a particular tissue when compared to the remaining 15 tissues using the DESeq package (Anders and Huber, 2010).

S2 Text: A set of significant genes

We considered the union of significant genes (between 1,321 and 1,400 genes for each popula- tion, Table 2), and defined as “African” those shared by YRI and LWK and neither or only one of the European populations, as “European” those shared by GBR and TSI and neither or only one of the African populations, and as ‘African and European’ those shared by all four popu- lations. This resulted in 1,051 genes shared between LWK and YRI and 1,089 shared between GBR and TSI. In total, this amounts to 1,470 genes (∼ 8% of all queried genes): 670 are con- sidered as “African and European”, 381 are considered as “African” and 419 are considered as “European”. These genes are presented in S8 Table.

S3 Text: Manual verification of reliability of SNPs contained in four of the outlier genes

116 Capítulo 1

Description

In the main text, we mention that among the top 10 outlier genes from Table 3 (considering the p-values for YRI), 6 have been reported previously as having signatures of LTBS: PROKR2 (Leffler et al., 2013), HLA-DQA1 (DeGiorgio et al., 2014), CPE (DeGiorgio et al., 2014), HLA- DRB5 (DeGiorgio et al., 2014), LUZP2 (DeGiorgio et al., 2014), and MYO3A (Asthana et al., 2005; DeGiorgio et al., 2014) (Tables 3 and S7).

Thus, the novel top candidates are: B4GALNT2 (Beta-1,4-N-Acetyl-Galactosaminyl Trans- ferase 2), C1orf101 (Chromosome 1 Open Reading Frame 101), NDUFA10 (NADH-Ubiquinone Oxidoreductase 42 KDa Subunit), and PCDH15 (Protocadherin-Related 15). In the main text, we discuss these genes in more detail, given that they have extreme signatures of LTBS shared across populations.

In order to certify that these genes have genuine extreme signatures of LTBS due to bal- ancing selection, and not: (a) bad SNP calls by collapsed reads of duplicates, and (b) non- homologous gene conversion between close paralogs, we performed a manual verification us- ing BLAT (http://www.ensembl.org/Multi/Tools/Blast?db=core).

Methods

For each gene, the corresponding FASTA sequence was taken from the hg19 reference genome and queried in BLAT. We only considered the top 100 hits for each gene. For each of the 100 hits, positions that coincide with a SNP position for the gene in the Phase 1 1000 Genomes data set (Abecasis et al., 2012) were manually verified. If the position is a match between the query and the hit, i.e, both have the same variant in the SNP position, this SNP is considered a match. If the position is a mismatch between the query and the hit, i.e, query and hit have different variants in the SNP position, this SNP is considered a mismatch.

A mismatched SNP could either have the alternate allele in the hit (alternate mismatch*) or an allele which is not the alternate allele in the 1000 G data set (simple mismatch). Further, we considered as more relevant and likely problematic those SNPs that not only are classified as alternate mismatch, but have somewhat intermediate frequencies (> 0.10). Results are provided for each gene separately, below. Location of the gene is provided, for reference.

117 Capítulo 1

Results

B4GALNT2

Description Position: chr17:47203660-47211160 (7,500 bp in total). In total, 96 SNPs in our filtered data (see Methods in main). Roughly 36 of them have somewhat intermediate frequen- cies.

BLAT After looking at all hits, we found three alternate mismatch SNPs. Only two of them have intermediate frequency: rs78050610,rs11654406), while the other is a singleton (rs140853454).

Conclusion Roughly 94 intermediate SNPs remain for this gene, making it thus unlikely that its signatures are dependent on problematic SNPs.

NDUFA10

Description Position: chr2:240850012-240854512 (4,500 bp in total). In total, 74 SNPs in our filtered data set (see Methods in main). Roughly 41 are intermediate in frequency in most Afr and Eur populations.

BLAT After looking at all hits, we found two alternate mismatch SNPs, one of which is intermediate frequency (rs6759128) and the other is low to intermediate (rs28429725).

Conclusion Roughly 29 intermediate SNPs remain for this gene, making it thus unlikely that its signatures are dependent on problematic SNPs.

PCDH15

Description Position: chr10:56902047-56911047 (9,000bp in total) 145 SNPs in our filtered data set (see Methods in main), roughly 63 at intermediate frequency.

118 Capítulo 1

BLAT After looking at all hits, we found 1 alternate mismatch SNP (rs188658080), which has verylow frequency in all populations or is absent for some populations (considering only African and European populations).

Conclusion Although this SNP is likely unreliable, it has low frequency. Coupled with the fact that only this position clearly appears to be problematic, we concluded that the signature observed for this gene is reliable.

C1orf101

Description Position: chr1: 244,617,679-244,804,479 (10,500 pb in total), 143 SNPs, roughly 76 at intermediate frequency.

BLAT After looking at all hits, we found 19 alternate mismatch SNPs. This seemed to be a potentially problematic candidate for LTBS, so we looked at in in further detail. Of the 19 alternate mismatch SNPs, between 7 and 12 have intermediate frequency. They are listed below.

Alternate mistmatch SNPs for C1orf101 (the ones with * have intermediate frequency and are thus more likely to be sources of bias for observed NCD2 values): rs3003250, rs3005972, rs3005973, rs138545291 (singleton), rs114536159, rs142620294, rs189591539(singleton), rs3003251*, rs3005938*, rs3005945*, rs3005947*, rs3005957*, rs3005958*, rs3005968*, rs3005969*, rs3005971*, rs3005975*, rs7538776*, rs9429008 *

Given that several SNPs for this gene are likely problematic (alternate mismatch with inter- mediate frequencies, we thus recalculated NCD2 for the windows that overlap this gene, after removing those SNPs. We removed the 19 SNPs (including the low/high frequency ones and singletons) from our data set (Methods in main), and recalculated NCD20.5 for the six outlier windows (Table 2) that overlap this gene for YRI. The six outlier windows that overlapped this gene have zero FDs. For the first of the six window, the removal of the problematic SNPs re- sulted in it going from 19 to 17 SNPs (IS=17), so it does not fulfill our criteria of having at least 19 IS in Africa (see Methods in main). The second windows goes from having 22 to 19 IS. It passes the "significance criterion" (simulation based p < 0.0001) and has Zt f value within the observed range for outlier windows.

119 Capítulo 1

Conclusion 19 SNPs are alternate mismatches, and 7-12 also have intermediate frequencies in most African and European populations. Even if all (19) SNPs are removed and NCD20.5 is re-calculated, at least one window remains as significant and probably outlier, demonstrating that this gene is likely a true positive. Moreover, there is an entire an entire range of the query (between positions 8,000-10,500) for which there are no hits, and this range contains 37 SNPs (around 12 intermediate frequency), further supporting the signatures observed for this gene.

Supplementary Tables

S1-A Table. Power analyses based on simulations (Africa). Reported values are for simulations following African demographic scenarios (see Methods). feq, frequency equilibrium used in the simulations; target frequency is the t f value used in NCD1 and NCD2; Tbs, time since onset of balancing selection; L, length of the simulated sequence. Reported values are always for a false posi- tive rate of 0.05.

NCD2 NCD1 Target Frequency

Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3 5 0.5 3 0.959 0.944 0.835 0.929 0.911 0.393 5 0.5 6 0.917 0.885 0.728 0.903 0.847 0.392 5 0.5 12 0.829 0.789 0.548 0.846 0.772 0.325 5 0.4 3 0.939 0.935 0.886 0.886 0.894 0.674 5 0.4 6 0.871 0.860 0.790 0.838 0.819 0.651 5 0.4 12 0.742 0.726 0.612 0.745 0.717 0.534 5 0.3 3 0.895 0.908 0.929 0.717 0.801 0.836 5 0.3 6 0.776 0.796 0.833 0.659 0.709 0.794 5 0.3 12 0.572 0.597 0.638 0.509 0.570 0.663 3 0.5 3 0.911 0.882 0.681 0.855 0.797 0.236 3 0.5 6 0.856 0.809 0.574 0.854 0.770 0.266 3 0.5 12 0.727 0.666 0.410 0.768 0.678 0.232 3 0.4 3 0.878 0.864 0.759 0.781 0.783 0.557 3 0.4 6 0.803 0.785 0.678 0.770 0.753 0.527 3 0.4 12 0.659 0.621 0.500 0.654 0.629 0.441 3 0.3 3 0.749 0.774 0.811 0.561 0.640 0.706 3 0.3 6 0.628 0.648 0.700 0.526 0.570 0.658

120 Capítulo 1

3 0.3 12 0.425 0.456 0.509 0.388 0.443 0.536 1 0.5 3 0.419 0.336 0.159 0.389 0.321 0.087 1 0.5 6 0.356 0.283 0.107 0.395 0.294 0.086 1 0.5 12 0.273 0.217 0.100 0.359 0.268 0.096 1 0.4 3 0.372 0.338 0.247 0.348 0.340 0.185 1 0.4 6 0.305 0.278 0.206 0.355 0.324 0.195 1 0.4 12 0.252 0.232 0.156 0.304 0.274 0.174 1 0.3 3 0.229 0.239 0.280 0.194 0.249 0.279 1 0.3 6 0.190 0.203 0.233 0.210 0.232 0.277 1 0.3 12 0.142 0.150 0.169 0.159 0.182 0.219

S1-B Table. Power analyses based on simulations (Europe). Reported values are for simulations following European demographic scenarios (see Methods). feq, frequency equilibrium used in the simulations; target frequency is the t f value used in NCD1 and NCD2; Tbs, time since onset of balancing selection; L, length of the simulated sequence. Reported values are always for a false positive rate of 0.05.

NCD2 NCD1 Target Frequency Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3 5 0.5 3 0.968 0.951 0.835 0.921 0.846 0.197 5 0.5 6 0.941 0.907 0.747 0.916 0.871 0.234 5 0.5 12 0.849 0.80 0.573 0.847 0.767 0.201 5 0.4 3 0.948 0.944 0.907 0.849 0.826 0.596 5 0.4 6 0.901 0.892 0.832 0.849 0.850 0.633 5 0.4 12 0.779 0.757 0.687 0.750 0.744 0.536 5 0.3 3 0.836 0.855 0.892 0.471 0.569 0.740 5 0.3 6 0.726 0.758 0.810 0.497 0.606 0.722 5 0.3 12 0.551 0.595 0.670 0.387 0.493 0.644 3 0.5 3 0.928 0.892 0.678 0.814 0.693 0.145 3 0.5 6 0.875 0.833 0.607 0.841 0.755 0.195 3 0.5 12 0.761 0.704 0.451 0.759 0.670 0.187 3 0.4 3 0.888 0.875 0.794 0.738 0.709 0.461 3 0.4 6 0.828 0.809 0.723 0.782 0.765 0.517 3 0.4 12 0.678 0.662 0.567 0.682 0.676 0.462 3 0.3 3 0.733 0.757 0.795 0.389 0.480 0.634 3 0.3 6 0.609 0.643 0.703 0.425 0.512 0.632 3 0.3 12 0.433 0.473 0.558 0.332 0.402 0.548 1 0.5 3 0.472 0.388 0.161 0.430 0.305 0.054

121 Capítulo 1

1 0.5 6 0.404 0.325 0.139 0.467 0.347 0.064 1 0.5 12 0.337 0.264 0.120 0.430 0.306 0.058 1 0.4 3 0.425 0.386 0.286 0.371 0.332 0.218 1 0.4 6 0.362 0.330 0.260 0.398 0.398 0.249 1 0.4 12 0.296 0.269 0.216 0.380 0.392 0.208 1 0.3 3 0.229 0.249 0.287 0.145 0.170 0.259 1 0.3 6 0.190 0.206 0.263 0.172 0.233 0.299 1 0.3 12 0.151 0.161 0.220 0.140 0.164 0.256

S1-C Table. Power analyses based on simulations (Asia). Reported values are for simulations following Asian demographic scenarios (see Methods). feq, frequency equilibrium used in the simulations; target frequency is the t f value used in NCD1 and NCD2; Tbs, time since onset of balancing selection; L, length of the simulated sequence. Reported values are always for a false positive rate of 0.05.

NCD2 NCD1 Target Frequency Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3 5 0.5 3 0.666 0.687 0.705 0.448 0.476 0.365 5 0.5 6 0.584 0.605 0.614 0.438 0.465 0.378 5 0.5 12 0.469 0.476 0.450 0.398 0.401 0.332 5 0.4 3 0.343 0.372 0.430 0.136 0.167 0.225 5 0.4 6 0.262 0.291 0.356 0.135 0.167 0.224 5 0.4 12 0.187 0.206 0.241 0.116 0.133 0.189 5 0.3 3 0.113 0.135 0.186 0.015 0.022 0.055 5 0.3 6 0.062 0.071 0.113 0.012 0.024 0.046 5 0.3 12 0.030 0.041 0.068 0.011 0.015 0.037 3 0.5 3 0.611 0.627 0.616 0.393 0.404 0.344 3 0.5 6 0.532 0.545 0.529 0.411 0.422 0.374 3 0.5 12 0.412 0.418 0.389 0.371 0.370 0.298 3 0.4 3 0.245 0.269 0.332 0.111 0.141 0.173 3 0.4 6 0.189 0.208 0.252 0.101 0.128 0.166 3 0.4 12 0.128 0.144 0.178 0.085 0.111 0.142 3 0.3 3 0.073 0.087 0.126 0.011 0.022 0.050 3 0.3 6 0.036 0.052 0.072 0.012 0.017 0.046 3 0.3 12 0.019 0.029 0.047 0.010 0.018 0.037 1 0.5 3 0.287 0.274 0.222 0.245 0.225 0.145 1 0.5 6 0.222 0.212 0.159 0.235 0.215 0.167 1 0.5 12 0.181 0.173 0.132 0.205 0.188 0.135

122 Capítulo 1

1 0.4 3 0.092 0.098 0.116 0.048 0.061 0.092 1 0.4 6 0.0.68 0.074 0.098 0.044 0.049 0.084 1 0.4 12 0.055 0.065 0.076 0.041 0.051 0.075 1 0.3 3 0.028 0.028 0.042 0.016 0.018 0.028 1 0.3 6 0.015 0.020 0.030 0.008 0.014 0.026 1 0.3 12 0.016 0.018 0.026 0.010 0.012 0.018

S2 Table. Gene ontology enrichment analyses for significant windows. The union of significant windows for at least one of the t f values is used. t f , tar- get frequency used in NCD equation. FDR, false discovery rate. genes (sims), expected number of genes in this category (see Methods). genes (data), actual number of genes in the category in the analyzed set.

GO term # genes (sims) # genes (data) p-value FDR Category description

LWK G-protein_coupled_receptor_ GO:0007186 35.17 61 0.00001 0.00104 signaling_pathway GO:0042612 0.13 4 0.00001 0.00104 MHC_class_I_protein_complex GO:0042613 0.291 10 0.00001 0.00104 MHC_class_II_protein_complex GO:0045095 0.961 9 0.00001 0.00104 keratin_filament GO:0042605 0.564 8 0.00001 0.00104 peptide_antigen_binding integral_to_lumenal_side_of_endoplasmic_ GO:0071556 0.689 12 0.00001 0.00104 reticulum_membrane GO:0032395 0.124 6 0.00001 0.00104 MHC_class_II_receptor_activity positive_regulation_of_T_cell_ GO:0001916 0.659 7 0.00001 0.00104 mediated_cytotoxicity GO:0016021 314.878 394 0.00001 0.00104 integral_to_membrane antigen_processing_and_presentation_of_ GO:0002504 0.216 7 0.00001 0.00104 peptide_or_polysaccharide_antigen_via _MHC_class_II GO:0006955 14.317 33 0.00001 0.00104 immune_response interferon-gamma-mediated_ GO:0060333 3.3 13 0.00001 0.00104 signaling_pathway antigen_processing_and_presentation_of_ GO:0002480 0.223 4 0.00001 0.00104 exogenous_peptide_antigen_via_ MHC_class_I,_TAP-independent GO:0019882 1.306 15 0.00001 0.00104 antigen_processing_and_presentation GO:0030658 2.507 11 0.00001 0.00104 transport_vesicle_membrane GO:0030669 0.972 8 0.00001 0.00104 clathrin-coated_endocytic_vesicle_membrane

123 Capítulo 1

GO:0004984 2.255 26 0.00001 0.00104 olfactory_receptor_activity GO:0012507 1.324 12 0.00001 0.00104 ER_to_Golgi_transport_vesicle_membrane GO:0004930 24.693 55 0.00001 0.00104 G-protein_coupled_receptor_activity GO:0005887 89.892 126 0.00002 0.00194 integral_to_plasma_membrane GO:0032588 3.136 11 0.00002 0.00194 trans-Golgi_network_membrane GO:0007608 4.93 14 0.00004 0.00387 sensory_perception_of_smell antigen_processing_and_presentation_of_ GO:0002479 2.565 10 0.00014 0.01231 exogenous_peptide_antigen_via_ MHC_class_I,_TAP-dependent antigen_processing_and_presentation_of_ GO:0019885 0.167 3 0.00014 0.01231 endogenous_peptide_antigen_via_ MHC_class_I antigen_processing_and_presentation_of_ GO:0042590 2.749 10 0.00024 0.02308 exogenous_peptide_antigen_via_ MHC_class_I GO:0046967 0.039 2 0.00027 0.02518 cytosol_to_ER_transport LWK (without HLA) G-protein_coupled_receptor_ GO:0007186 34.905 61 0.00001 0.00393 signaling_pathway GO:0045095 0.956 9 0.00001 0.00393 keratin_filament GO:0016021 312.414 387 0.00001 0.00393 integral_to_membrane GO:0004984 2.239 26 0.00001 0.00393 olfactory_receptor_activity GO:0004930 24.507 55 0.00001 0.00393 G-protein_coupled_receptor_activity GO:0005887 89.171 122 0.00005 0.01537 integral_to_plasma_membrane GO:0007608 4.891 14 0.00005 0.01537 sensory_perception_of_smell GO:0042605 0.531 5 0.00013 0.0347 peptide_antigen_binding YRI GO:0042612 0.139 4 0.00001 0.00114 MHC_class_I_protein_complex GO:0042613 0.311 11 0.00001 0.00114 MHC_class_II_protein_complex GO:0045095 1.034 9 0.00001 0.00114 keratin_filament GO:0042605 0.605 7 0.00001 0.00114 peptide_antigen_binding integral_to_lumenal_side_of_endoplasmic_ GO:0071556 0.743 12 0.00001 0.00114 reticulum_membrane GO:0032395 0.132 7 0.00001 0.00114 MHC_class_II_receptor_activity GO:0016021 334.989 401 0.00001 0.00114 integral_to_membrane antigen_processing_and_presentation_of_ GO:0002504 0.231 7 0.00001 0.00114 peptide_or_polysaccharide_antigen_via_ MHC_class_II GO:0006955 15.313 36 0.00001 0.00114 immune_response

124 Capítulo 1

interferon-gamma-mediated_ GO:0060333 3.529 13 0.00001 0.00114 signaling_pathway antigen_processing_and_presentation_of_ GO:0002480 0.24 4 0.00001 0.00114 exogenous_peptide_antigen_via_ MHC_class_I,_TAP-independent GO:0019882 1.401 16 0.00001 0.00114 antigen_processing_and_presentation GO:0030658 2.646 10 0.00001 0.00114 transport_vesicle_membrane GO:0030669 1.043 8 0.00001 0.00114 clathrin-coated_endocytic_vesicle_membrane GO:0004984 2.416 21 0.00001 0.00114 olfactory_receptor_activity GO:0012507 1.424 12 0.00001 0.00114 ER_to_Golgi_transport_vesicle_membrane GO:0004930 26.334 51 0.00001 0.00114 G-protein_coupled_receptor_activity G-protein_coupled_receptor_ GO:0007186 37.515 62 0.00003 0.00336 signaling_pathway GO:0032588 3.336 11 0.00003 0.00336 trans-Golgi_network_membrane detection_of_chemical_stimulus_involved_ GO:0050911 0.244 4 0.00009 0.00993 in_sensory_perception_of_smell GO:0005576 92.15 123 0.00023 0.02592 extracellular_region positive_regulation_of_T_cell_ GO:0001916 0.707 5 0.00033 0.03707 mediated_cytotoxicity YRI (without HLA) G-protein_coupled_receptor_ GO:0007186 37.187 62 0.00001 0.00492 signaling_pathway GO:0045095 1.023 9 0.00001 0.00492 keratin_filament GO:0004984 2.397 21 0.00001 0.00492 olfactory_receptor_activity GO:0004930 26.103 51 0.00001 0.00492 G-protein_coupled_receptor_activity GO:0016021 332.421 394 0.00002 0.00802 integral_to_membrane GO:0005576 91.502 123 0.0004 0.03721 extracellular_region detection_of_chemical_stimulus_involved_ GO:0050911 0.242 4 0.00013 0.03885 in_sensory_perception_of_smell GBR GO:0042612 0.128 4 0.00001 0.00129 MHC_class_I_protein_complex GO:0042613 0.285 9 0.00001 0.00129 MHC_class_II_protein_complex GO:0042605 0.562 6 0.00001 0.00129 peptide_antigen_binding integral_to_lumenal_side_of_ GO:0071556 0.683 12 0.00001 0.00129 endoplasmic_reticulum_membrane GO:0032395 0.12 6 0.00001 0.00129 MHC_class_II_receptor_activity GO:0016021 312.004 411 0.00001 0.00129 integral_to_membrane GO:0005576 85.721 124 0.00001 0.00129 extracellular_region

125 Capítulo 1

antigen_processing_and_presentation_of_ GO:0002504 0.211 6 0.00001 0.00129 peptide_or_polysaccharide_antigen_via_ MHC_class_II interferon-gamma-mediated_ GO:0060333 3.266 16 0.00001 0.00129 signaling_pathway antigen_processing_and_presentation_of_ GO:0002480 0.221 4 0.00001 0.00129 exogenous_peptide_antigen_via_MHC_ class_I,_TAP-independent GO:0019882 1.294 15 0.00001 0.00129 antigen_processing_and_presentation GO:0030669 0.96 9 0.00001 0.00129 clathrin-coated_endocytic_vesicle_membrane GO:0004984 2.241 21 0.00001 0.00129 olfactory_receptor_activity GO:0012507 1.314 12 0.00001 0.00129 ER_to_Golgi_transport_vesicle_membrane GO:0004930 24.487 50 0.00001 0.00129 G-protein_coupled_receptor_activity GO:0007186 34.875 58 0.00002 0.00235 G-protein_coupled_receptor_signaling_pathway GO:0006955 14.165 31 0.00002 0.00235 immune_response GO:0030658 2.469 10 0.00003 0.00346 transport_vesicle_membrane GO:0045095 0.956 7 0.00005 0.00569 keratin_filament GO:0060337 1.679 8 0.00009 0.01008 type_I_interferon_signaling_pathway GO:0032588 3.114 10 0.00014 0.01412 trans-Golgi_network_membrane GO:0007608 4.894 13 0.00019 0.02054 sensory_perception_of_smell positive_regulation_of_T_cell_ GO:0001916 0.654 5 0.0002 0.02087 mediated_cytotoxicity GBR (without HLA) G-protein_coupled_receptor_ GO:0007186 34.566 58 0.00001 0.0039 signaling_pathway GO:0016021 309.57 404 0.00001 0.0039 integral_to_membrane GO:0005576 85.005 124 0.00001 0.0039 extracellular_region GO:0004984 2.217 21 0.00001 0.0039 olfactory_receptor_activity GO:0004984 24.276 50 0.00001 0.0039 G-protein_coupled_receptor_activity GO:0045095 0.948 7 0.00003 0.01045 keratin_filament TSI GO:0042613 0.304 8 0.00001 0.00163 MHC_class_II_protein_complex GO:0042605 0.588 6 0.00001 0.00163 peptide_antigen_binding integral_to_lumenal_side_of_ GO:0071556 0.718 10 0.00001 0.00163 endoplasmic_reticulum_membrane GO:0032395 0.129 5 0.00001 0.00163 MHC_class_II_receptor_activity GO:0016021 325.611 414 0.00001 1.63251-3 integral_to_membrane GO:0005576 89.537 140 0.00001 0.00163 extracellular_region

126 Capítulo 1

antigen_processing_and_presentation_of_ GO:0002504 0.225 6 0.00001 0.00163 peptide_or_polysaccharide_antigen_via_ MHC_class_II GO:0060333 3.413 13 0.00001 0.00163 interferon-gamma-mediated_signaling_pathway GO:0019882 1.357 13 0.00001 0.00163 antigen_processing_and_presentation GO:0004984 2.347 23 0.00001 0.00163 olfactory_receptor_activity GO:0012507 1.371 10 0.00001 0.00163 ER_to_Golgi_transport_vesicle_membrane GO:0004930 25.566 48 0.00001 0.00163 G-protein_coupled_receptor_activity GO:0007608 5.077 15 0.00002 0.00313 sensory_perception_of_smell GO:0006955 14.835 31 0.00005 0.00762 immune_response GO:0030669 1.007 7 0.00006 0.00861 clathrin-coated_endocytic_vesicle_membrane GO:0060402 2.707 8 0.00017 0.02497 calcium_ion_transport_into_cytosol GO:0030658 2.578 9 0.00018 0.02503 transport_vesicle_membrane GO:0032588 3.244 10 0.00022 0.02925 trans-Golgi_network_membrane GO:0042612 0.135 3 0.00023 0.02925 MHC_class_I_protein_complex TSI (without HLA) GO:0016021 323.377 407 0.00001 0.003894 integral_to_membrane GO:0005576 88.921 140 0.00001 0.003894 extracellular_region GO:0007608 5.05 15 0.00001 0.003894 sensory_perception_of_smell GO:0004984 2.326 23 0.00001 0.003894 olfactory_receptor_activity GO:0004930 25.39 48 0.00001 0.003894 G-protein_coupled_receptor_activity

S3 Table. Gene ontology enrichment analyses for outlier windows. The union of significant windows for at least one of the t f values is used. t f , target fre- quency used in NCD equation. FDR, false discovery rate. genes (sims), ex- pected number of genes in this category (see Methods). genes (data), actual number of genes in the category in the analyzed set. Because no categories ra- mained significant after removal of HLA genes, these sets are not reported.

GO term # genes (sims) # genes (data) p-value FDR Category description

YRI GO:0019882 0.157 5 0.00005 0.00402 antigen_processing_and_presentation GO:0030669 0.119 4 0.00005 0.00402 clathrin-coated_endocytic_vesicle_membrane GO:0032395 0.015 3 0.00005 0.00402 MHC_class_II_receptor_activity GO:0042613 0.034 5 0.00005 0.00402 MHC_class_II_protein_complex integral_to_lumenal_side_of_endoplasmic_ GO:0071556 0.081 4 0.00005 0.00402 reticulum_membrane GO:0030658 0.354 5 0.00002 0.00674 transport_vesicle_membrane GO:0012507 0.157 4 0.00004 0.01191 ER_to_Golgi_transport_vesicle_membrane

127 Capítulo 1

antigen_processing_and_presentation_of_ GO:0002504 0.025 3 0.00005 0.01318 peptide_or_polysaccharide_antigen_via_ MHC_class_II GO:0031295 0.444 5 0.00011 0.02661 T_cell_costimulation LWK antigen_processing_and_presentation_of_ GO:0002504 0.029 5 0.00005 0.00129 peptide_or_polysaccharide_antigen_via _MHC_class_II GO:0019882 0.177 10 0.00005 0.00129 antigen_processing_and_presentation GO:0019221 1.556 10 0.00005 0.00129 cytokine-mediated_signaling_pathway GO:0030658 0.395 7 0.00005 0.00129 transport_vesicle_membrane GO:0030669 0.133 6 0.00005 0.00129 clathrin-coated_endocytic_vesicle_membrane GO:0032588 0.491 6 0.00005 0.00129 trans-Golgi_network_membrane GO:0031295 0.495 7 0.00005 0.00129 T_cell_costimulation GO:0012507 0.177 9 0.00005 0.00129 ER_to_Golgi_transport_vesicle_membrane GO:0032395 0.016 5 0.00005 0.00129 MHC_class_II_receptor_activity GO:0042612 0.017 3 0.00005 0.00129 MHC_class_I_protein_complex GO:0042613 0.039 7 0.00005 0.00129 MHC_class_II_protein_complex GO:0006955 1.986 16 0.00005 0.00129 immune_response GO:0042605 0.074 4 0.00005 0.00129 peptide_antigen_binding interferon-gamma-mediated_ GO:0060333 0.455 9 0.00005 0.00129 signaling_pathway integral_to_lumenal_side_of_endoplasmic_ GO:0071556 0.092 9 0.00005 0.00129 reticulum_membrane antigen_processing_and_presentation_of_ GO:0002480 0.029 3 0.00005 0.00129 exogenous_peptide_antigen_via_MHC_class_I,_ TAP-independent positive_regulation_of_T_ GO:0001916 0.089 3 0.00008 0.01018 cell_mediated_cytotoxicity antigen_processing_and_presentation_of_ GO:0019886 0.775 6 0.0001 0.01099 exogenous_peptide_antigen_via_MHC_class_II GO:0030666 0.752 6 0.0001 0.01099 endocytic_vesicle_membrane GO:0005765 2.206 10 0.0001 0.01099 lysosomal_membrane negative_regulation_of_ GO:0032689 0.099 3 0.00012 0.01257 interferon-gamma_production GO:0030670 0.305 4 0.00035 0.03549 phagocytic_vesicle_membrane TSI

128 Capítulo 1

antigen_processing_and_presentation_of_ GO:0002504 0.027 5 0.00005 0.00131 peptide_or_polysaccharide_antigen_via_ MHC_class_II GO:0019882 0.17 9 0.00005 0.00131 antigen_processing_and_presentation GO:0019221 1.477 9 0.00005 0.00131 cytokine-mediated_signaling_pathway GO:0030658 0.375 6 0.00005 0.00131 transport_vesicle_membrane GO:0030669 0.126 5 0.00005 0.00131 clathrin-coated_endocytic_vesicle_membrane GO:0031295 0.467 6 0.00005 0.00131 T_cell_costimulation ER_to_Golgi_transport_ GO:0012507 0.168 8 0.00005 0.00131 vesicle_membrane GO:0032395 0.016 4 0.00005 0.00131 MHC_class_II_receptor_activity GO:0042612 0.016 3 0.00005 0.00131 MHC_class_I_protein_complex GO:0042613 0.036 6 0.00005 0.00131 MHC_class_II_protein_complex GO:0006955 1.898 15 0.00005 0.00131 immune_response GO:0042605 0.072 4 0.00005 0.00131 peptide_antigen_binding GO:0060333 0.434 9 0.00005 0.00131 interferon-gamma-mediated_signaling_pathway integral_to_lumenal_side_of_endoplasmic_ GO:0071556 0.088 8 0.00005 0.00131 reticulum_membrane antigen_processing_and_presentation_ of_exogenous_peptide_antigen_ GO:0002480 0.029 3 0.00005 0.00131 via_MHC_ class_I,_TAP-independent positive_regulation_of_T_cell_ GO:0001916 0.085 3 0.00005 0.00628 mediated_cytotoxicity GO:0060337 0.214 4 0.00005 0.00628 type_I_interferon_signaling_pathway GO:0032588 0.466 5 0.00006 0.0072 trans-Golgi_network_membrane GO:0030670 0.291 4 0.00028 0.03158 phagocytic_vesicle_membrane GBR antigen_processing_and_presentation_of_peptide_ GO:0002504 0.024 6 0.00005 0.00112 or_polysaccharide_antigen_via_MHC_class_II GO:0019882 0.151 11 0.00005 0.00112 antigen_processing_and_presentation GO:0019221 1.332 10 0.00005 0.00112 cytokine-mediated_signaling_pathway GO:0030658 0.339 7 0.00005 0.00112 transport_vesicle_membrane GO:0030669 0.112 6 0.00005 0.00112 clathrin-coated_endocytic_vesicle_membrane GO:0032588 0.421 6 0.00005 0.00112 trans-Golgi_network_membrane GO:0031295 0.421 6 0.00005 0.00112 T_cell_costimulation GO:0012507 0.151 9 0.00005 0.00112 ER_to_Golgi_transport_vesicle_membrane GO:0032395 0.014 5 0.00005 0.00112 MHC_class_II_receptor_activity GO:0042612 0.014 3 0.00005 0.00112 MHC_class_I_protein_complex

129 Capítulo 1

GO:0042613 0.033 7 0.00005 0.00112 MHC_class_II_protein_complex GO:0006955 1.7 16 0.00005 0.00112 immune_response GO:0042605 0.064 4 0.00005 0.00112 peptide_antigen_binding interferon-gamma-mediated_ GO:0060333 0.391 10 0.00005 0.00112 signaling_pathway GO:0005765 1.871 10 0.00005 0.00112 lysosomal_membrane integral_to_lumenal_side_of_ GO:0071556 0.079 9 0.00005 0.00112 endoplasmic_reticulum_membrane antigen_processing_and_presentation_of_ GO:0002480 0.025 3 0.00005 0.00112 exogenous_peptide_antigen_via_MHC_ class_I,_TAP-independent GO:0030666 0.644 6 0.00002 0.00222 endocytic_vesicle_membrane positive_regulation_of_T_cell_ GO:0001916 0.074 3 0.00003 0.00311 mediated_cytotoxicity GO:0060337 0.195 4 0.00003 0.00311 type_I_interferon_signaling_pathway antigen_processing_and_presentation_of_ GO:0019886 0.662 6 0.00005 0.00497 exogenous_peptide_antigen_via_MHC_class_II GO:0050852 0.963 6 0.00017 0.01696 T_cell_receptor_signaling_pathway GO:0030670 0.26 4 0.0002 0.01922 phagocytic_vesicle_membrane antigen_processing_and_presentation_of_ GO:0002479 0.298 4 0.00029 0.02645 exogenous_peptide_antigen_via_MHC_class_I,_ TAP-dependent antigen_processing_and_presentation_of_ GO:0042590 0.32 4 0.00032 0.02834 exogenous_peptide_antigen_via_MHC_class_I GO:0050776 0.362 4 0.00053 0.04594 regulation_of_immune_response

S4 Table. Phenotype ontology (PO) enrichment analysis FDR, false discov- ery rate (see Methods). Gene sets without any significant enrichment are not shown. genes (sims), expected number of genes in this category (see Methods). genes (data), actual number of genes in the category in the analyzed set. tf, the set of significant windows for this t f was significant.

YRI tf PO term # genes (sims) # genes (data) p-value FDR Description 0.5 HP:0000591 0.023 3 0.00007 0.04093 Abnormality_of_the_sclera YRI (without HLA) 0.5 HP:0000591 0.023 3 0.00006 0.04216 Abnormality_of_the_sclera

130 Capítulo 1

S5 Table. Tissue-specific (TG) expression enrichment analysis The union of significant windows for at least one t f is used. FDR, false discovery rate (see Methods). Gene sets without any significant enrichment are not shown. genes (sims), expected number of genes in this category (see Methods). genes (data), actual number of genes in the category in the analyzed set.

TSI t f TG term # genes (sims) # genes (data) p-value FDR Description Union TG:02 5.417 12 0.00261 0.03046 adrenal TSI (without HLA) Union TG:02 5.406 12 0.00266 0.02877 adrenal GBR Union TG:10 14.868 25 0.00419 0.04943 lung

S6 Table. Assigned t f values Reported values are numbers of significant (top) and outlier (bottom) windows (see Methods). The union of windows that are significant or outlier for at least one of the t f values was used and showed in the last column. Percentages refer to the proportion of windows with a given assigned t f value (the one that minimizes NCD2, see Methods). "|" denotes "or", i.e, when a window is assingned to more than one t f .

Significant Target Frequency POP 0.3 0.4 0.5 0.4|0.3 0.5|0.4 0.3|0.4|0.5 Union LWK 4049(52%) 1002(13%) 2705(35%) 2 10 2 7770 YRI 4481(53%) 1083(13%) 2863(34%) 3 4 2 8436 GBR 4217(49%) 1062(13%) 3238(38%) 3 6 0 8526 TSI 4080(49%) 1172(14%) 31339(37%) 4 6 0 8395 Outlier POP 0.3 0.4 0.5 0.4|0.3 0.5|0.4 0.3|0.4|0.5 Union LWK 565(50%) 142 424(37%) 1 5 2 1139 YRI 587(51%) 144 404(36%) 2 3 0 1142 GBR 584(52%) 129 417(37%) 0 1 0 1131 TSI 571(563%) 148 440(38%) 3 1 0 1163

S7 Table. List of outlier genes This is the same list reported in Table 3, but included additional information. In purple, "African" genes; in orange, "Euro- pean" genes and in green , "African and European" (see main text). P, p-value of the most exteme window overlapping the gene; tf, assigned target frequency of the window with lowest p-value. When a gene is "African" or "European" but one of the populations from the other continents also has an extreme window for the gene, it is highlighted with the same color code.

YRI LWK GBR TSI

131 Capítulo 1

Chr Gene Acronym tf P tf P tf P tf P 9 ABO 0.3 0.000422957 0.3 0.000131178 0.3 0.000674892 0.3 0.000782164 4 ADAM29 0.5 0.000491611 0.5 0.000233546 0.3 0.001198991 0.3 0.001785001 6 AIM1 0.5 0.000268486 0.5 0.000469543 0.3 0.002997477 0.4 0.000785842 2 ALK 0.3 0.0000883 0.3 0.000119531 0.3 0.00053881 0.3 0.000380661 22 ARHGAP8 0.3 0.000361046 0.4 0.000334075 0.3 0.001790517 0.3 0.000108498 14 ATXN3 0.5 0.000470769 0.5 0.000245805 0.3 0.0006847 0.3 0.000549231 1 BCAR3 0.3 0.000410697 0.5 0.000390469 0.3 0.009124835 0.3 0.004408559 12 C12orf54 0.3 0.00038863 0.4 0.000418666 0.3 0.000673666 0.4 0.000638726 1 C1orf101 0.4 0.00000368 0.3 0.0000153 0.3 0.014656988 0.3 0.000584171 22 C22orf34 0.5 0.0000791 0.5 0.000095 0.3 0.000426635 0.3 0.001254159 13 COG6 0.5 0.000468317 0.5 0.000399051 0.5 0.000622789 0.5 0.000578654 4 COL25A1 0.5 0.000366563 0.5 0.000457284 0.4 0.000483029 0.3 0.000744159 10 CUBN 0.4 0.000409471 0.4 0.000257452 0.3 0.000359207 0.3 0.000757644 2 DIRC3 0.3 0.000182055 0.3 0.000141599 0.3 0.004005218 0.3 0.005316384 18 DTNA 0.3 0.000357981 0.3 0.000205349 0.3 0.007393164 0.3 0.001966443 3 EPHA6 0.3 0.000223738 0.5 0.0000895 0.3 0.006470627 0.3 0.000641791 3 FRMD4B 0.5 0.000164892 0.3 0.000186959 0.3 0.000654051 0.3 0.003204665 16 GPR114 0.5 0.000487933 0.5 0.000225577 0.3 0.017186761 0.3 0.025924191 7 GTF2IRD1 0.3 0.000126887 0.3 0.000216382 0.3 0.082588766 0.5 0.084745846 6 HLA-DQA2 0.3 0.000209639 0.3 0.00029607 0.3 0.002912273 0.3 0.002816648 4 IGFBP7 0.3 0.0000938 0.3 0.000172861 0.3 0.000399664 0.3 0.000703702 1 LGALS8 0.5 0.000361659 0.5 0.000196767 0.3 0.000831815 0.5 0.000490998 4 LGI2 0.5 0.000375144 0.5 0.000463414 0.5 0.000827524 0.5 0.000653438 11 LUZP2 0.3 0.0000276 0.3 0.0000227 0.3 0.002680566 0.3 0.002470926 8 MYOM2 0.5 0.000084 0.5 0.0000821 0.3 0.000918858 0.3 0.000866755 18 NFATC1 0.5 0.000205349 0.5 0.000217608 0.3 0.003760638 0.3 0.002065746 11 OR52A1 0.4 0.0000398 0.3 0.0000184 0.3 0.000447476 0.3 0.001662404 6 PACRG 0.5 0.000130565 0.5 0.000123209 0.5 0.000765 0.5 0.000776034 1 PADI2 0.5 0.000345721 0.4 0.000433378 0.5 0.000478738 0.4 0.000561491 3 PARP15 0.4 0.00021393 0.5 0.000123822 0.5 0.000364724 0.4 0.000834267 6 PDE10A 0.5 0.00023232 0.5 0.000338365 0.3 0.002663402 0.3 0.002801936 PRR5- 22 0.3 0.000361046 0.4 0.000334075 0.3 0.001790517 0.3 0.000108498 ARHGAP8 12 PTPRB 0.5 0.000307103 0.5 0.000177764 0.3 0.002825229 0.3 0.00122351 11 PTS 0.5 0.0000828 0.5 0.000101755 0.3 0.000357368 0.3 0.001069039 6 RNF39 0.4 0.000328558 0.3 0.0000944 0.3 0.003772898 0.3 0.000408858

132 Capítulo 1

RP11- 15 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743 96O20.4 10 SFTPD 0.3 0.000194315 0.3 0.0000147 0.3 0.023205621 0.3 0.004575903 8 SGCZ 0.5 0.000256226 0.5 0.00048119 0.3 0.000722705 0.4 0.000495289 6 SLC17A5 0.5 0.000250709 0.5 0.00033101 0.5 0.008776049 0.3 0.004632297 11 SLC35F2 0.5 0.000266034 0.5 0.00031875 0.5 0.009484655 0.5 0.011214487 1 SPRR3 0.5 0.000357368 0.5 0.000496515 0.5 0.000538197 0.5 0.000263582 20 SPTLC3 0.5 0.000460962 0.4 0.000395373 0.4 0.000560878 0.5 0.000449315 15 SQRDL 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743 5 STK32A 0.5 0.000274615 0.5 0.000300361 0.5 0.000370241 0.3 0.000521034 14 STXBP6 0.5 0.000163053 0.5 0.000146502 0.3 0.001623787 0.3 0.00021393 20 TGM6 0.3 0.0000638 0.3 0.00013363 0.3 0.000148341 0.3 0.000667536 13 TMCO3 0.5 0.000337753 0.5 0.00046464 0.3 0.001407404 0.3 0.001530613 16 WWOX 0.3 0.000407019 0.5 0.000427248 0.3 0.000239676 0.3 0.00133875 19 ZNF331 0.5 0.000378209 0.5 0.000473834 0.3 0.00091089 0.3 0.000663245 3 ALDH1L1 0.3 0.000354303 0.3 0.000328558 0.4 0.001094171 0.4 0.001080072 22 CELSR1 0.3 0.000129952 0.3 0.000253774 0.3 0.007483272 0.3 0.010092732 5 COMMD10 0.3 0.000278906 0.3 0.00023232 0.3 0.000710445 0.3 0.001828522 2 MLPH 0.3 0.000350625 0.4 0.000460962 0.3 0.00040518 0.3 0.002850975 18 NEDD4L 0.5 0.000389856 0.3 0.000369015 0.3 0.000840397 0.5 0.001739027 14 OR6J1 0.3 0.000134856 0.3 0.000158762 0.3 0.000568233 0.3 0.000383726 6 SLC22A16 0.3 0.000498354 0.3 0.000435829 0.3 0.010716746 0.3 0.00173351 3 SUMF1 0.3 0.000151406 0.3 0.000326719 0.3 0.00174577 0.3 0.000546166 17 ZZEF1 0.3 0.000253161 0.3 0.000253161 0.3 0.000133017 0.3 0.000502644 15 C15orf48 0.5 0.000165505 0.3 0.00036595 0.3 0.517718828 0.3 0.471580976 6 CCHCR1 0.3 0.000426022 0.3 0.000300361 0.3 0.000782164 0.3 0.000416827 3 CLDN16 0.3 0.000385565 0.3 0.000255613 0.3 0.014823106 0.3 0.010374703 8 EXTL3 0.3 0.000269712 0.3 0.000457897 0.5 0.064383231 0.3 0.00184446 2 IL37 0.3 0.0000368 0.3 0.000430313 0.3 0.007752983 0.3 0.001856106 5 NR3C1 0.3 0.000401503 0.3 0.000410084 0.3 0.000630757 0.3 0.000290553 1 PGLYRP4 0.4 0.0000975 0.3 0.000274615 0.3 0.005533992 0.3 0.006012117 5 SLC27A6 0.3 0.000466479 0.3 0.000497741 0.5 0.00057988 0.4 0.000313233 8 STAU2 0.3 0.000329171 0.3 0.000444411 0.3 0.00096238 0.3 0.000500805 12 TMEM132D 0.5 0.000429087 0.3 0.000454832 0.3 0.00082875 0.5 0.001126659 11 TMEM135 0.3 0.000266647 0.3 0.000291166 0.3 0.000403954 0.3 0.000597043 17 WSCD1 0.5 0.000492837 0.3 0.000438894 0.3 0.001060457 0.3 0.000837945 1 ZNF670 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717 1 ZNF695 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717 12 AC121757.1 0.5 0.000594592 0.5 0.000751515 0.5 0.000419279 0.5 0.000359207

133 Capítulo 1

10 ADAM12 0.3 0.000508161 0.3 0.000504483 0.5 0.000144664 0.3 0.000123822 20 ADRA1D 0.3 0.000798714 0.3 0.000883919 0.3 0.000235998 0.3 0.000304652 6 AL590867.1 0.3 0.000575589 0.3 0.001535517 0.5 0.000351238 0.5 0.00036595 17 B3GNTL1 0.3 0.002595974 0.4 0.001552068 0.3 0.000171635 0.3 0.00041744 10 BICC1 0.3 0.001851816 0.3 0.002502801 0.5 0.000407632 0.5 0.000411923 1 C1orf222 0.3 0.00446434 0.3 0.000803618 0.3 0.00000797 0.3 0.000116466 13 CCDC169 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918 CCDC169- 13 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918 SOHLH2 17 CEP112 0.3 0.000579267 0.3 0.000451154 0.5 0.000258678 0.4 0.000275841 1 CNR2 0.3 0.000635661 0.5 0.000484868 0.5 0.000120144 0.5 0.000181442 7 CNTNAP2 0.3 0.000609303 0.3 0.000894339 0.4 0.000355529 0.3 0.000498967 3 CPNE4 0.5 0.000777873 0.3 0.000324267 0.5 0.000391082 0.5 0.000399051 8 CSMD1 0.3 0.000632596 0.3 0.000731899 0.3 0.0000374 0.3 2.57452E-05 4 FRAS1 0.5 0.000617885 0.4 0.000409471 0.5 0.000406406 0.3 0.000336527 9 FXN 0.5 0.000689604 0.5 0.000559652 0.5 0.000375144 0.5 0.000418053 9 GABBR2 0.3 0.000416827 0.3 0.000905986 0.3 0.000141599 0.3 0.000375144 22 GRAMD4 0.5 0.006669846 0.5 0.0068188 0.5 0.000457284 0.5 0.000486707 10 GRID1 0.3 0.013559139 0.3 0.00317708 0.5 0.0000215 0.3 4.78125E-05 12 GRIP1 0.4 0.002874268 0.3 0.000784003 0.3 0.000318137 0.4 0.000397212 7 HUS1 0.5 0.000526551 0.5 0.000349399 0.5 0.000489159 0.5 0.000335914 8 IDO2 0.5 0.000663245 0.5 0.000345721 0.4 0.000384339 0.5 0.000200445 2 IL18R1 0.3 0.001869592 0.3 0.000394147 0.5 0.000265421 0.4 0.000315685 2 IL1RL1 0.3 0.001869592 0.3 0.000394147 0.5 0.000265421 0.4 0.000315685 13 KL 0.5 0.000530842 0.5 0.000665084 0.3 0.000302813 0.3 0.000478738 12 KRT83 0.5 0.000594592 0.5 0.000751515 0.5 0.000419279 0.5 0.000359207 1 LAMC2 0.3 0.000373305 0.5 0.001037164 0.5 0.000218221 0.5 0.000234159 18 LDLRAD4 0.4 0.001146274 0.5 0.001184279 0.5 0.000386791 0.5 0.000328558 11 MYO7A 0.5 0.000855108 0.5 0.00084714 0.5 0.000494063 0.5 0.000434604 14 NRXN3 0.3 0.001166503 0.3 0.001046972 0.3 0.000264808 0.5 0.000215156 1 NSUN4 0.3 0.000539423 0.3 0.001890433 0.3 0.000136082 0.3 0.000272164 12 NTN4 0.3 0.000913955 0.5 0.002137465 0.5 0.000389856 0.4 0.000362272 OVCH1- 12 0.3 0.000649147 0.3 0.000624628 0.5 0.000339591 0.5 0.000262356 AS1 6 PKHD1 0.5 0.001134628 0.5 0.001141371 0.5 0.000257452 0.5 0.000239676 1 PPAP2B 0.3 0.000578041 0.3 0.000499579 0.5 0.000363498 0.5 0.000346334 14 PSMC1 0.5 0.000457897 0.5 0.000562717 0.5 0.00048119 0.5 0.000406406 16 RBFOX1 0.5 0.000643017 0.3 0.000335914 0.5 0.000492837 0.4 0.00045851 1 REG4 0.3 0.000530842 0.3 0.000849592 0.5 0.000409471 0.5 0.000314459

134 Capítulo 1

6 RUNX2 0.5 0.000683474 0.5 0.000625854 0.5 0.00011524 0.3 0.000229255 4 SEPT11 0.3 0.000912116 0.4 0.000619724 0.5 0.000256226 0.3 0.000395986 13 SGCG 0.3 0.00030649 0.3 0.00086369 0.5 0.0000944 0.4 0.000102981 19 SLC1A6 0.5 0.000507548 0.5 0.000115853 0.5 0.000169183 0.5 0.000129952 4 SLC2A9 0.3 0.000534519 0.3 0.000558426 0.5 0.00034143 0.5 0.000240289 13 SOHLH2 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918 9 SVEP1 0.3 0.000692669 0.3 0.000551683 0.3 0.000147115 0.3 0.000450541 11 TCP11L1 0.5 0.000770517 0.4 0.000732512 0.5 0.000476899 0.4 0.000497128 4 TEC 0.3 0.000610529 0.3 0.000298522 0.3 0.000319363 0.5 0.00022006 1 TNN 0.3 0.000291166 0.3 0.000523486 0.5 0.000209027 0.5 0.000171635 2 TNS1 0.4 0.000453606 0.5 0.000617885 0.5 0.000299135 0.5 0.000346947 11 TRIM5 0.3 0.000370853 0.5 0.000619111 0.5 0.000204123 0.5 0.000217608 9 TRPM3 0.5 0.000587236 0.5 0.000676731 0.5 0.000410697 0.4 0.000348786 3 VPS8 0.5 0.000796875 0.5 0.000364111 0.5 0.0003825 0.5 0.000335301 8 WDYHV1 0.3 0.000932344 0.3 0.001286034 0.3 0.000329784 0.3 0.000297909 16 CDH5 0.4 0.001592525 0.3 0.000656503 0.3 0.000305878 0.3 0.000150793 5 ITGA1 0.3 0.002080457 0.3 0.001984832 0.3 0.000424183 0.3 0.000488546 9 KDM4C 0.5 0.004607165 0.3 0.002454989 0.3 0.000401503 0.3 0.000391082 11 ALG8 0.4 0.001277452 0.5 0.00042357 0.3 0.000361659 0.3 0.000312007 4 ATP10D 0.3 0.003962309 0.3 0.001068426 0.3 0.000413762 0.3 0.000486707 2 COL4A3 0.3 0.000563942 0.3 0.001126659 0.3 0.000434604 0.3 0.00045238 17 CRHR1 0.5 0.017642818 0.5 0.019007314 0.3 0.000390469 0.5 0.000369628 18 DOK6 0.3 0.001796647 0.3 0.001343654 0.3 0.00045851 0.3 0.000374531 19 GNA15 0.3 0.000422344 0.3 0.000527777 0.3 0.000274615 0.3 0.000152632 6 HLA-DRA 0.3 0.001896563 0.4 0.000440733 0.3 0.000456671 0.3 0.000398438 3 KALRN 0.5 0.001564327 0.3 0.000631983 0.3 0.000498354 0.5 0.000235385 17 KANSL1 0.3 0.0643912 0.3 0.095343674 0.3 0.000445637 0.4 0.000327945 12 OAS1 0.3 0.122187337 0.3 0.122196532 0.3 0.000478738 0.5 0.000413149 7 ORC5 0.3 0.000741707 0.3 0.000407019 0.3 0.000367176 0.3 0.000295457 1 PGBD5 0.3 0.000567007 0.3 0.000665084 0.3 0.000422957 0.3 0.000425409 9 POLR1E 0.3 0.000843462 0.3 0.000389856 0.3 0.000304039 0.3 0.000355529 4 RASSF6 0.3 0.000520421 0.3 0.000393534 0.3 0.000259904 0.5 0.000121983 7 SKAP2 0.3 0.001242512 0.5 0.002187729 0.3 0.000438894 0.3 0.000449315 19 ZNF254 0.3 0.015283455 0.3 0.010679968 0.3 0.000269099 0.3 0.000334688 10 APBB1IP 0.5 0.00024274 0.5 0.000327945 0.5 0.000223125 0.5 0.000201671 17 B4GALNT2 0.5 0.0000343 0.5 0.0000343 0.5 0.000118305 0.5 0.000148341 12 BICD1 0.3 0.0000778 0.3 0.000112788 0.3 0.000377596 0.3 0.000362885 1 C1orf68 0.5 0.000326719 0.5 0.000231707 0.5 0.000437055 0.5 0.000378209 10 CAMK1D 0.5 0.0000533 0.5 0.000038 0.5 0.00000306 0.5 4.90385E-06

135 Capítulo 1

4 CPE 0.3 0.0000227 0.3 0.0000288 0.3 0.0000386 0.3 3.00361E-05 10 DMBT1 0.5 0.000106046 0.5 0.000145889 0.5 0.000224964 0.3 0.000498354 1 EDARADD 0.5 0.000117079 0.5 0.000102368 0.3 0.000348173 0.3 0.000222512 14 EGLN3 0.3 0.0000944 0.3 0.0000729 0.5 0.000136082 0.3 0.000152019 1 ERO1LB 0.5 0.00031262 0.5 0.000270938 0.5 0.000335914 0.5 0.000264808 22 FAM19A5 0.3 0.000288714 0.3 0.000169796 0.3 0.000432152 0.3 0.000275228 19 FCER2 0.5 0.000155697 0.5 0.000147115 0.5 0.000274002 0.3 0.000250709 1 GPR137B 0.5 0.00031262 0.5 0.000270938 0.5 0.000335914 0.5 0.000264808 4 GPRIN3 0.5 0.000144664 0.5 0.000175313 0.5 0.000263582 0.5 0.000310168 11 HBE1 0.5 0.000262969 0.5 0.00022619 0.5 0.000220673 0.5 0.000459123 11 HBG2 0.5 0.000262969 0.5 0.00022619 0.5 0.000220673 0.5 0.000459123 7 HIP1 0.5 0.000225577 0.5 0.0000313 0.5 0.00003 0.5 3.67789E-06 6 HLA-B 0.3 0.00012137 0.3 0.0000846 0.3 0.000106659 0.3 0.000117079 6 HLA-C 0.3 0.0000405 0.3 0.0000417 0.3 0.0001275 0.3 0.000138534 6 HLA-DPA1 0.4 0.0000797 0.5 0.0000251 0.3 0.000215156 0.3 0.000221899 6 HLA-DPB1 0.5 0.0000932 0.5 0.0000509 0.4 0.000142825 0.3 8.82693E-05 6 HLA-DQA1 0.3 0.00000245 0.5 0.00000184 0.3 0.0000552 0.3 1.53245E-05 6 HLA-DQB1 0.3 0.000135469 0.3 0.00018512 0.3 0.000108498 0.3 0.00012137 6 HLA-DQB2 0.3 0.0000717 0.4 0.0000564 0.5 0.000365337 0.3 0.000346334 6 HLA-DRB1 0.3 0.000109111 0.3 0.000191863 0.3 0.000087 0.3 0.000137921 6 HLA-DRB5 0.3 0.0000251 0.5 0.0000264 0.4 0.0000233 0.3 3.49399E-05 6 HLA-G 0.4 0.000186959 0.5 0.00023845 0.5 0.00041744 0.5 0.000258678 2 LRP1B 0.5 0.000197993 0.5 0.000200445 0.5 0.000291779 0.3 0.000266034 4 MANBA 0.3 0.000459123 0.5 0.000216382 0.5 0.00033101 0.5 0.000383113 11 MMP26 0.3 0.0000901 0.3 0.0000828 0.3 0.00000368 0.3 0.000161214 2 MROH2A 0.4 0.000180829 0.4 0.000182668 0.5 0.000412536 0.5 0.000192476 10 MYO3A 0.3 0.0000343 0.3 0.0000362 0.5 0.0000153 0.5 1.40986E-05 3 MYRIP 0.5 0.000134856 0.4 0.000175926 0.3 0.000321815 0.3 0.000300974 21 NCAM2 0.5 0.000151406 0.5 0.000110337 0.4 0.0000221 0.3 6.74279E-05 2 NDUFA10 0.5 0.0000282 0.3 0.0000558 0.3 0.000226803 0.3 0.000253774 10 OLAH 0.3 0.000451154 0.3 0.000200445 0.3 0.0000797 0.5 4.41346E-05 6 PARK2 0.5 0.000416214 0.5 0.00036595 0.5 0.000272164 0.5 0.000375144 10 PCDH15 0.3 0.0000184 0.4 0.0000196 0.5 0.000129952 0.3 0.000128113 10 PDSS1 0.3 0.000353077 0.4 0.000221286 0.4 0.000465253 0.5 0.000452993 6 PHACTR2 0.5 0.000231707 0.5 0.000369015 0.3 0.000451767 0.4 0.000219447 20 PLCB4 0.5 0.000153858 0.5 0.000408858 0.5 0.000183894 0.5 0.000188798 21 PRDM15 0.3 0.000149567 0.5 0.000145276 0.3 0.0000429 0.4 7.66226E-05 8 PREX2 0.4 0.000124435 0.5 0.000196154 0.5 0.000308329 0.5 0.000190637 20 PROKR2 0.5 0.00000184 0.4 0.00000306 0.5 0.000000613 0.5 1.22596E-06

136 Capítulo 1

6 RP11-257K9.8 0.5 0.0000172 0.5 0.0000319 0.5 0.000152019 0.5 8.58173E-05 2 SH3RF3 0.3 0.000102368 0.5 0.000076 0.3 0.000270938 0.3 0.000134856 20 SIRPA 0.5 0.000207188 0.5 0.000202897 0.3 0.00046893 0.3 0.000419892 8 SLA 0.3 0.000304652 0.3 0.000122596 0.3 0.000258065 0.3 8.03005E-05 11 SNX19 0.5 0.000120757 0.5 0.000125661 0.5 0.000164892 0.5 0.000166731 2 SPAG16 0.3 0.000342043 0.3 0.000318137 0.5 0.000189411 0.3 0.000199832 3 SUCLG2 0.3 0.000270325 0.4 0.000270938 0.5 0.000122596 0.5 0.000326106 5 SV2C 0.5 0.000125048 0.5 0.0000932 0.4 0.0000276 0.3 1.47115E-05 8 TG 0.3 0.000304652 0.3 0.000122596 0.3 0.000258065 0.3 8.03005E-05 21 TMPRSS2 0.4 0.000100529 0.5 0.000136695 0.5 0.0000484 0.5 2.51322E-05 12 TMTC2 0.5 0.000132404 0.5 0.0000423 0.3 0.000374531 0.3 0.000255 4 UGT2B4 0.3 0.000212091 0.5 0.000314459 0.3 0.000131178 0.3 0.000142212 3 WNT7A 0.3 0.00022619 0.3 0.000118918 0.5 0.000241515 0.5 0.000165505 6 ZC3H12D 0.5 0.000185733 0.5 0.000275841 0.5 0.0000552 0.3 9.8077E-05 3 ZNF385D 0.3 0.000438281 0.3 0.000299748 0.5 0.000304039 0.5 0.00026726 19 ZNF83 0.5 0.0000766 0.5 0.000174087 0.5 0.000135469 0.5 0.000117692 6 CDSN 0.3 0.000153858 0.3 0.000184507 0.4 0.000143438 0.4 0.00011524 12 CHST11 0.5 0.000250096 0.3 0.000182055 0.3 0.000292392 0.3 0.000245805 1 FMN2 0.3 0.00020351 0.3 0.000103594 0.3 0.000228029 0.3 2.69712E-05 6 PSORS1C1 0.3 0.000153858 0.3 0.000184507 0.4 0.000143438 0.4 0.00011524 10 CTNNA3 0.5 0.000488546 0.3 0.000342043 0.4 0.0000405 0.3 4.96515E-05 12 FAM101A 0.4 0.000371466 0.3 0.000301587 0.5 0.000249483 0.5 0.000126887 1 RIMKLA 0.3 0.000084 0.3 0.000321815 0.3 0.000114014 0.3 0.000163666 3 SPATA16 0.5 0.000394147 0.3 0.000436442 0.3 0.0000778 0.3 0.000181442 2 THSD7B 0.3 0.000376983 0.3 0.000467704 0.5 0.000251935 0.5 0.000281358

S8 Table. Significant genes Significant genes pass the significance criteria in at least two populations from the same continent. See main text and Supplemen- tary Text 2.

ABAT CCDC38 ETFB KCNH5 PRR5L SORCS2 WDR27 ABCA12 CCDC50 ETV1 KCNIP1 PRRC1 SORCS3 WDR64 ABCA13 CCDC57 EVC2 KCNIP4 PRSS38 SP100 WDR72 ABCA4 CCDC85C EXO1 KCNJ6 PRSS45 SP110 WDR75 ABCB11 CCDC91 EXOC2 KCNK2 PRSS50 SP140L WDR93 ABCC2 CCHCR1 EXOC3L2 KCNMA1 PSD3 SPACA3 WDYHV1 ABCC4 CCNG2 EXOC4 KCNMB2 PSMC1 SPAG16 WFDC3 ABCC8 CCSER1 EXOC7 KCNQ1 PSMG4 SPARCL1 WFDC8

137 Capítulo 1

ABCD4 CD70 EXTL3 KCNQ3 PSORS1C1 SPATA13 WFIKKN2 ABCG1 CD84 EYA4 KCNQ5 PSTPIP2 SPATA16 WNT7A ABI1 CD96 EYS KCNS3 PTCHD3 SPATA22 WNT9B ABTB2 CDA F13A1 KCTD8 PTCHD4 SPATA3 WSCD1 AC004466.1 CDC42BPA F5 KDM4C PTGFRN SPATC1L WSCD2 AC004824.2 CDH12 FABP2 KHNYN PTH2R SPEF2 WWC1 AC023469.1 CDH13 FAHD1 KIAA0196 PTP4A3 SPHKAP WWOX AC073528.1 CDH18 FAM101A KIAA0319 PTPRB SPINK2 WWTR1 AC087645.1 CDH22 FAM114A1 KIAA1024 PTPRD SPINK5 XKR3 AC091801.1 CDH23 FAM129B KIAA1199 PTPRK SPINT4 XRCC4 AC092687.4 CDH4 FAM135B KIAA1211 PTPRM SPNS3 XXYLT1 AC121757.1 CDH5 FAM13A KIAA1217 PTPRN2 SPRR1B YIPF1 ACBD5 CDH7 FAM154A KIAA1324 PTPRT SPRR3 YIPF7 ACPP CDH9 FAM155A KIAA1324L PTS SPTLC2 ZBTB16 ACTA2 CDKAL1 FAM173B KIF13A PUS7 SPTLC3 ZC3H12C ACTL8 CDSN FAM179A KIF16B PXDN SQRDL ZC3H12D ADAM12 CDYL2 FAM184B KIF26A PXDNL SRD5A1 ZDHHC14 ADAM29 CEACAM7 FAM189A1 KIF26B PYROXD1 SRL ZFAT ADAMTS1 CELA3B FAM19A2 KIRREL3 PYROXD2 SSR1 ZFP57 ADAMTS12 CELF5 FAM19A5 KL RAB36 ST6GAL1 ZFYVE28 ADAMTS16 CELSR1 FAM43A KLF12 RAB3C ST6GALNAC3 ZNF114 ADAMTS17 CEP112 FAM47E KLHDC8A RAB8A ST8SIA1 ZNF155 FAM47E- ADAMTS18 CEP128 KLHL1 RABEP1 ST8SIA6 ZNF254 STBD1 ADAMTS2 CERS6 FAM65B KLHL14 RAMP3 STAP1 ZNF28 ADAMTS3 CFHR1 FAM81B KLHL23 RASGEF1B STAU2 ZNF280A ADAMTS5 CFHR2 FANK1 KLHL24 RASSF2 STIM1 ZNF283 ADAMTSL1 CHD5 FARS2 KLHL5 RASSF6 STK32A ZNF331 ADAMTSL2 CHKB FAS KLHL7 RBFOX1 STK32B ZNF345 ADAP1 CHL1 FAT2 KLK13 RBFOX3 STK32C ZNF354A ADARB2 CHN2 FBXL17 KLRB1 RBM11 STK39 ZNF354B STON1- ADAT2 CHRM2 FCER2 KNG1 RBMS3 ZNF365 GTF2A1L ADCY2 CHST11 FCGBP KRT14 RBP4 STON2 ZNF366 ADCY3 CLCA2 FER1L6 KRT40 RCAN1 STPG1 ZNF385D ADCY5 CLCNKB FFAR4 KRT6A RCAN2 STPG2 ZNF391 ADD2 CLDN10 FGF12 KRT8 RCBTB1 STT3A ZNF423 ADH4 CLDN11 FGF14 KRT83 RECK STX2 ZNF44 ADHFE1 CLDN16 FHAD1 KRT84 REG4 STX8 ZNF441

138 Capítulo 1

ADRA1A CLEC1A FHIT KRTAP10-7 RELN STXBP5L ZNF443 ADRA1D CLEC1B FIP1L1 KRTAP12-2 RERGL STXBP6 ZNF468 AEBP2 CLEC3A FKRP KRTAP3-2 RFTN1 SUCLG2 ZNF568 AGAP1 CLEC4C FMN1 KRTAP5-5 RFX2 SULT2B1 ZNF577 AGBL1 CLEC6A FMN2 KSR2 RFX8 SUMF1 ZNF670 AGMAT CLIC5 FMO2 KY RGL1 SUN3 ZNF677 AGPAT9 CLMP FNDC1 L3MBTL2 RGS1 SV2C ZNF695 AHRR CLOCK FNDC3B L3MBTL4 RGS6 SVEP1 ZNF697 AIM1 CLSTN2 FOXK2 LAMA1 RGSL1 SVIL ZNF738 AIPL1 CMBL FRAS1 LAMA2 RHBDL3 SWAP70 ZNF74 AKNAD1 CMC2 FRMD4A LAMA4 RIMBP2 SYCE3 ZNF773 AL160286.1 CMKLR1 FRMD4B LAMB4 RIMKLA SYCP2L ZNF804B AL355531.2 CNBD1 FSIP1 LAMC1 RIMS1 SYK ZNF83 AL590867.1 CNN2 FTO LAMC2 RIN2 SYN3 ZNF85 ALDH18A1 CNR2 FUT9 LAMC3 RNASE11 SYNE1 ZNF879 ALDH1L1 CNTLN FXN LAPTM4B RNF144B SYNE3 ZNF98 ALDH4A1 CNTN4 FYB LBH RNF150 SYNJ2 ZPLD1 ALG8 CNTN5 GABBR2 LDB3 RNF175 SYNPR ZSCAN18 ALK CNTNAP2 GABRG3 LDLRAD3 RNF19A SYT16 ZSWIM2 ALPL CNTNAP4 GABRR1 LDLRAD4 RNF212 SYT9 ZZEF1 AMTN CNTNAP5 GADL1 LGALS8 RNF39 TAB2 ANK1 COBLL1 GALC LGI2 RNPEP TACC3 ANK2 COG6 GALNT10 LGR5 ROBO2 TANC1 ANK3 COL21A1 GALNT13 LHFPL2 ROR1 TAP2 ANKH COL24A1 GALNT14 LHPP ROR2 TAS2R14 ANKRD24 COL25A1 GALNT18 LIMCH1 RORA TAS2R20 RP1- ANKS1B COL28A1 GALNT8 LINC00908 TAS2R42 139D8.6 RP11- ANO2 COL4A2 GALNT9 LINC00923 TBC1D22A 156P1.2 RP11- ANO3 COL4A3 GALNTL6 LINGO2 TBC1D7 192H23.4 RP11- ANXA4 COL4A4 GANC LIPC TBX20 210M15.2 RP11- AOAH COMMD10 GAS2 LITAF TCP11L1 215A19.2 RP11- AOC1 CORIN GATM LMX1B TCTN2 257K9.8

139 Capítulo 1

RP11- AP3B1 COX19 GCNT3 LNX1 TDRD10 295P9.3 RP11- APBB1IP CPA3 GFRA2 LOXL2 TEAD2 297M9.1 RP11- APBB2 CPA5 GIPC2 LPHN2 TEC 302M6.4 RP11- APIP CPB2 GLDN LPHN3 TEK 307N16.6 RP11- APPBP2 CPE GLIPR1L2 LPIN1 TEKT1 321F6.1 RP11- ARHGAP22 CPLX1 GLIS1 LPIN2 TENM2 383H13.1 RP11- ARHGAP24 CPNE4 GMNC LPPR1 TENM3 389E17.1 RP11- ARHGAP28 CPNE8 GNA15 LRCH1 TENM4 433C9.2 RP11- ARHGAP44 CPXM2 GNG2 LRP1B TES 45H22.3 RP11- ARHGAP8 CREB5 GNLY LRRC16A TESC 463C8.4 RP11- ARHGEF10L CRELD2 GOLM1 LRRC7 TESPA1 697E2.6 RP11- ARHGEF18 CRHR1 GOPC LRRFIP1 TEX2 96O20.4 RP13- ARHGEF37 CRTAC1 GOSR2 LRRK2 TFB2M 279N23.2 RP5- ARL14EPL CRTC3 GPC5 LRRTM4 TG 1052I5.2 RP5- ARL15 CRX GPC6 LSAMP TGM6 966M1.6 ARSB CRYL1 GPD1L LTBP1 RPA3-AS1 THBS2 ARSJ CSGALNACT1 GPLD1 LUZP2 RPAIN THBS4 ART1 CSMD1 GPR111 LYAR RPGRIP1 THSD4 ART3 CSMD2 GPR114 LYPD6B RPS6KA2 THSD7A ASAH2 CSMD3 GPR115 MACC1 RPSA THSD7B ASAP1 CSN3 GPR133 MACROD2 RPTOR TIAM1 ASB18 CSRP1 GPR137B MAGI1 RRM1 TIAM2 CTB- ASGR2 GPR158 MAGI2 RRP12 TIFA 129P6.11

140 Capítulo 1

CTD- ASIC2 GPR78 MAMDC2 RUNX1 TJP2 2207O23.3 CTD- ASPA GPRIN3 MAML3 RUNX2 TLDC1 2260A17.2 CTD- ASTN2 GRAMD3 MANBA RXFP1 TLK1 2287O16.3 CTD- ATF7IP2 GRAMD4 MAP3K13 RYR1 TLR10 2616J11.11 CTD- ATP10A GRB10 MAPT RYR2 TMCC3 3088G3.8 CTD- ATP10D GREB1 "MARCH1" RYR3 TMCO3 3105H18.16 CTD- ATP2C2 GRHL2 MARCH4 SAMD12 TMED3 3105H18.18 ATP6V0A4 CTIF GRID1 MARCH7 SAMD3 TMEM104 ATP6V0E2 CTNNA2 GRID2 MARK4 SAMD5 TMEM106B ATP6V1E1 CTNNA3 GRIK1 MAST4 SBF2 TMEM117 ATP8A1 CTNND2 GRIK2 MATN1 SCARB2 TMEM128 ATP8A2 CUBN GRIK4 MB21D2 SCD5 TMEM129 ATP9A CUX1 GRIN2A MCF2L SCLY TMEM132B ATRNL1 CWF19L2 GRIN3A MCF2L2 SCML4 TMEM132C ATXN3 CXCL11 GRIN3B MCM9 SCN1A TMEM132D AVEN CYB5A GRIP1 MDGA2 SCN3A TMEM135 AXDND1 CYB5R2 GRM4 MECOM SCNN1G TMEM156 B3GNTL1 CYBRD1 GRM7 MEGF11 SCP2 TMEM179 B4GALNT2 CYP24A1 GRM8 MEIOB SCUBE1 TMEM220 BAI3 CYP4F12 GSTO1 MEOX2 SDC2 TMEM229B BARD1 CYP4F3 GTF2H4 MFAP3 SDK2 TMEM232 BBS9 DAAM1 GTF2IRD1 MFSD6L SDR39U1 TMEM244 BCAR3 DAAM2 GTF3C6 MGAT5 SEMA3A TMEM259 BCAS1 DAB1 GUCA1A MGAT5B SEMA3E TMEM44 BCAS3 DAB2 HAAO MGMT SEMA6D TMEM51 BCKDHB DAD1 HABP2 MGST2 SEPT11 TMEM63C BCL2L14 DAPK1 HAGH MGST3 SEPT9 TMEM71 BCR DCBLD1 HBE1 MICAL3 SERINC5 TMEM88B BDH1 DCC HBG2 MICALCL SERPINA5 TMPRSS11E BEST3 DCDC2C HDAC4 MICB SERPINB5 TMPRSS2 BFSP2 DCHS2 HDAC7 MIS12 SFTPD TMTC1 BICC1 DCTD HEATR1 MITF SGCG TMTC2

141 Capítulo 1

BICD1 DEFB128 HECW1 MLF1IP SGCZ TMTC4 BIN2 DEPDC7 HECW2 MLK4 SH2D4B TNFAIP8 BIRC5 DEPTOR HEG1 MLPH SH3RF2 TNFSF10 BLNK DGKH HHAT MMP2 SH3RF3 TNIK BLOC1S5 DHRS4 HHLA1 MMP20 SHISA6 TNN BLOC1S5- DHX37 HHLA2 MMP26 SHROOM3 TNS1 TXNDC5 BMPR1B DIEXF HIP1 MOB3B SIRPA TNS3 BNC2 DIP2A HIST1H2AA MOBP SIRT3 TONSL BNIP2 DIRC3 HIST1H2BA MORF4L1 SKA1 TOP1MT BRE DKKL1 HIVEP3 MOV10L1 SKAP2 TPCN2 BSPRY DLC1 HJURP MPHOSPH6 SLA TPD52 BTNL2 DLG2 HLA-B MPND SLC12A3 TPK1 C10orf112 DLGAP1 HLA-C MPP7 SLC15A2 TPO C12orf36 DMBT1 HLA-DPA1 MROH2A SLC16A7 TPRX2P C12orf54 DMGDH HLA-DPB1 MROH2B SLC17A5 TRAT1 C12orf55 DMRT1 HLA-DQA1 MRPS22 SLC1A2 TRDN C13orf45 DNAAF1 HLA-DQA2 MRS2 SLC1A6 TRIB3 C15orf41 DNAH11 HLA-DQB1 MS4A12 SLC22A16 TRIM22 C15orf48 DNAH8 HLA-DQB2 MSH3 SLC22A9 TRIM5 C16orf95 DNAJC16 HLA-DRA MSMO1 SLC24A4 TRIM9 C1orf101 DNER HLA-DRB1 MSR1 SLC25A21 TRPA1 C1orf177 DNHD1 HLA-DRB5 MSRA SLC25A24 TRPC6 C1orf198 DNM1L HLA-F MTHFD1L SLC25A37 TRPM3 C1orf222 DOCK1 HLA-G MTSS1 SLC26A3 TRPM5 C1orf68 DOCK5 HLCS MTUS1 SLC27A6 TRPS1 C1orf94 DOK6 HMCN1 MTUS2 SLC2A8 TSHR C20orf166 DOK7 HMGCLL1 MUC16 SLC2A9 TSHZ2 C20orf196 DPF3 HNF4A MUC22 SLC30A10 TSNARE1 TSNAX- C22orf34 DPP10 HPCAL4 MUC4 SLC35F2 DISC1 C2orf54 DPP6 HPSE MYLK4 SLC35F3 TSPAN15 C2orf83 DPY19L1 HPSE2 MYO15B SLC35F4 TSPAN18 C4orf19 DPY19L4 HS3ST4 MYO16 SLC37A1 TSPAN5 C4orf50 DRAM1 HS6ST3 MYO1B SLC38A8 TSPAN8 C5orf17 DSC1 HSPA12A MYO1D SLC38A9 TSPAN9 C6orf10 DSCAM HSPB3 MYO1H SLC39A11 TSPEAR C6orf15 DSG1 HTRA4 MYO3A SLC39A14 TSSC1 C6orf165 DTNA HUS1 MYO3B SLC48A1 TTC37

142 Capítulo 1

C6ORF165 DYNC2H1 IDO2 MYO5B SLC5A12 TTC6 C6orf58 DYTN IGFBP7 MYO7A SLC6A1 TTC9 C7orf31 EDARADD IGSF5 MYOF SLC6A5 TTLL11 EEF1E1- C8orf34 IL18R1 MYOM1 SLC7A11 TTLL13 BLOC1S5 C8orf46 EFCAB11 IL1RL1 MYOM2 SLC7A5 TULP3 C9orf91 EGFR IL36RN MYOZ3 SLC8A3 TXN2 CABP5 EGLN1 IL37 MYPN SLC9A4 TXNDC5 CACNA1A EGLN3 IL7R MYRFL SLC9C1 TYW1 CACNA1C EIF2B5 IMPA2 MYRIP SLCO2A1 UBASH3B CACNA2D3 EIF4E2 INPP1 MYSM1 SLCO2B1 UBE2F-SCLY CACNG2 ELAC2 INPP4B NAAA SLCO6A1 UGT2A1 CADM2 ELFN2 INPP5D NAALADL2 SMC6 UGT2A2 CADPS ELSPBP1 IP6K3 NADSYN1 SMCO2 UGT2B4 CALD1 EMCN IQGAP2 NANOG SMG7 ULK4 CALN1 EMILIN2 IQSEC1 NAT1 SMIM12 UNC13C CAMK1D EMR1 IRF1 NAT2 SMOC2 UNC5B CAND2 ENPP1 ITGA1 NAV2 SMOX UPP2 CAPN14 ENTPD4 ITGA2 NAV3 SMR3B USH2A CAPN9 EPHA10 ITGAE NBAS SMYD3 USP20 CASQ2 EPHA6 ITIH4 NBEA SNTB1 UTS2B CCBE1 EPHA7 ITPR3 NCALD SNTG1 VARS2 CCDC102B EPHB1 IZUMO1 NCAM2 SNTG2 VAV2 CCDC113 EPRS JAKMIP3 NCF2 SNX18 VAV3 CCDC129 ERAP1 KALRN NCF4 SNX19 VEGFC CCDC146 ERAP2 KANK1 NCK2 SNX29 VNN1 CCDC149 ERBB4 KANK4 NCKAP5 SNX31 VPS8 CCDC158 ERC1 KANSL1 NCMAP SNX7 VRK3 CCDC169 ERG KAT2B NDFIP1 SOAT1 VSTM5 CCDC169- ERICH1 KCNA6 NDST4 SOHLH2 VWA5B1 SOHLH2 CCDC171 ERO1LB KCNAB1 NDUFA10 SORBS1 VWF ESRRG ESPNL KCNB2 NDUFAF6 SORBS2 WBSCR17 ESYT2 ESR1 KCNC2 NEBL SORCS1 WDR17

143 Capítulo 1

Supplementary Figures

S1 Fig. NCD analytical properties as a function of number of SNPs x-axis, number of SNPs in the window for which NCD20.5 is being calculated; y-axis, NCD20.5 values. Each color corresponds to one (non-variable) number of FDs per window (20, 40, 100). Top, x-axis reaches 3,000 SNPs. After ∼ 1,500 SNPs (for any of the number of FDs), NCD20.5 stabilizes and asymptotically approaches 0. Bottom, a zoom-in of the upper plot, with x-axis reaching only 100 SNPs. In this representation, all SNPs have a frequency of 0.5.

144 Capítulo 1

S2 Fig. Analytical properties of NCD2 as a function of number of FDs x-axis, number of FDs in the window for which NCD20.5 is being calculated; y-axis, NCD20.5 values. Each color corresponds to one the frequency of the 20 SNPs in the window (0.5, 0.4, 0.3). Top, x-axis reaches 3,000 FDs. After ∼ 500 FDs, NCD20.5 stabilizes and asymptotically approaches 0.5; Bottom, a zoom-in of the upper plot, with x-axis reaching only 100 FDs. In this representation, all 20 SNPs have the same frequency (0.5 in blue, 0.4 in red, 0.3 in gray). Note that the minimum NCD20.5 value is different for the different colors, since they represent different SNP frequencies.

145 Capítulo 1 O uvsfrsqec egh ( lengths sequence for curves on ROC length sequence of Fig.Effect S3 P,flepstv ae(0-pcfiiy;TR repstv ae(estvt,o oe) oeta h -xsranges x-axis the that Note 1. to power). 0 or from (sensitivity, ranges and y-axis rate scenario the positive demographic while African true 0.05, the to TPR, 0 under (100-Specificity); from simulations rate on positive based ( false frequency (green), equilibrium FPR, 0.2 an or achieve (pink), to 0.3 modeled is (orange), polymorphism balanced the where simulations L f()3K,()6K,ad()1 b ahpo shows plot Each Kb. 12 (C) and Kb, 6 (B) Kb, 3 (A) of ) NCD 2 0.5 oe (Africa). power NCD 2 f 0.5 eq f05(le,0.4 (blue), 0.5 of ) efrac for performance Tbs = myr. 5

146 Capítulo 1 3 myr. = Tbs performance for ) of 0.5 (blue), 0.4 eq 0.5 f 2 NCD power (Africa). 0.5 2 NCD ) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows L ROC curves for sequence lengths ( S4 Fig.Effect of sequence length on FPR, false positive ratefrom (100-Specificity); 0 TPR, to 0.05, true while positive the rate y-axis ranges (sensitivity, from or 0 power). to 1. Note that the x-axis ranges simulations where the balanced polymorphism(orange), is modeled 0.3 to (pink), achieve or an 0.2 equilibrium (green), frequency ( based on simulations under the African demographic scenario and

147 Capítulo 1 O uvsfrsqec egh ( lengths sequence for curves on ROC length sequence of Fig.Effect S5 oag) . pn) r02(re) ae nsmltosudrteErpa eorpi cnroand scenario demographic European the under simulations ( on frequency based equilibrium (green), an 0.2 achieve or to (pink), modeled 0.3 is (orange), polymorphism balanced the where simulations P,flepstv ae(0-pcfiiy;TR repstv ae(estvt,o oe) oeta h -xsranges x-axis the that Note 1. to power). 0 or from (sensitivity, ranges y-axis rate the positive while true 0.05, to TPR, 0 (100-Specificity); from rate positive false FPR, L f()3b B K,ad()1 b ahpo shows plot Each Kb. 12 (C) and 6Kb, (B) 3Kb, (A) of ) NCD 2 0.5 oe (Europe). power NCD 2 f 0.5 eq f05(le,0.4 (blue), 0.5 of ) efrac for performance Tbs = myr. 5

148 Capítulo 1 3 myr. = Tbs performance for ) of 0.5 (blue), 0.4 eq 0.5 f 2 NCD power (Europe). 0.5 2 NCD ) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows L ROC curves for sequence lengths ( S6 Fig.Effect of sequence length on FPR, false positive ratefrom (100-Specificity); 0 TPR, to 0.05, true while positive the rate y-axis ranges (sensitivity, from or 0 power). to 1. Note that the x-axis ranges simulations where the balanced polymorphism(orange), is 0.3 modeled (pink), to or achieve 0.2 an (green), equilibrium based frequency on ( simulations under the European demographic scenario and

149 Capítulo 1 O uvsfrsqec egh ( lengths sequence for curves on ROC length sequence of Fig.Effect S7 P,flepstv ae(0-pcfiiy;TR repstv ae(estvt,o oe) oeta h -xsranges x-axis the that Note 1. to power). and 0 or from scenario (sensitivity, ranges demographic y-axis rate Asian the positive while the true 0.05, under to TPR, simulations 0 (100-Specificity); from on rate based positive ( (green), false frequency 0.2 equilibrium FPR, or an achieve (pink), to 0.3 modeled is (orange), polymorphism balanced the where simulations L f()3b B K,ad()1 b ahpo shows plot Each Kb. 12 (C) and 6Kb, (B) 3Kb, (A) of ) NCD 2 0.5 oe (Asia). power NCD 2 f 0.5 eq f05(le,0.4 (blue), 0.5 of ) efrac for performance Tbs = myr. 5

150 Capítulo 1 3 myr. = Tbs performance for ) of 0.5 (blue), 0.4 eq 0.5 f 2 NCD power (Asia). 0.5 2 NCD ) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows L ROC curves for sequence lengths ( S8 Fig.Effect of sequence length on simulations where the balanced polymorphism(orange), is modeled 0.3 to (pink), achieve an orFPR, equilibrium 0.2 frequency false (green), ( positive based rate onfrom (100-Specificity); 0 simulations TPR, to under 0.05, true the while positive the Asian rate y-axis demographic ranges (sensitivity, scenario from or 0 and power). to 1. Note that the x-axis ranges

151 Capítulo 1

Fig S9. Correlations for NCD2tf calculated with different t f values. In each plot, NCD2 values calculated with two different target frequencies are plotted against each other. NCD2 was calculated for 1,000 neutral simulations following demographic parameters for the African continent. L = 3 Kb.

152 Capítulo 1 3 = L 5 myr, = Tbs . eq f 2 matches the simulated NCD 1 and NCD ) of (left) 0.3, (center) 0.4, and (right) 0.5. Plotted values are for European demography, eq f Fig S10. ROC curves forPower comparison to between detect NCD20.5 LTBS and for otherrium simulations tests ( where (Europe). the balanced polymorphism was modeled to achieve frequency equilib- kb). Target frequency for

153 Capítulo 1

Fig S11. Relationship between NDC2tf and the number of informative sites. NCD2tf was calculated for neutral simulations (10,000 for each bin of IS) for African demographic scenario and the 0.01 quantile value for each bin is plot- ted. Blue (t f = 0.5), orange (t f = 0.4), pink (t f = 0.3), green (t f = 0.2).

154 Capítulo 1

Fig S12. Proportion of windows per chromosome. Sets of significant and out- lier windows are derived from the union of three target frequencies (0.3, 0.4, 0.5). Grey, all genomic windows; significant (green) and (blue) outlier windows.

Fig S13. Proportion of positions in the genome retained after each filter. Pro- portion of the hg19 human reference genome (total base-pairs = 2,684,573,005) retained for each individual filtering criterium described in the Methods, and for all filters jointly applied together. Proportion of sequences retained: Map50=0.843; TRF=0.976; SD=0.961; pantro2=0.961; all=0.819. Map50: mappa- bility 50-mer (see Methods); TRF: tandem repeats; SD: segmental duplications; pantro2: reference chimp genome.

155 Capítulo 1

Fig S14. Distribution of proportion of high coverage (pHC) positions per bin of empirical NCD2 p-value. pHC in percentage (y-axis) binned by the NCD2 Ztf empirical p-values represented in –log10 scale on the x-axis. pHC is the proportion of the sequence of a given window in this study having high coverage values in at least two samples of modern human shotgun data (see Methods).

156 Capítulo 1

Fig S15. Proportion of sequences pertaining to each functional category. y- axis, the proportion of sequences over the total that belong to each category. x-axis, sets of significant windows for YRI, LWK, GBR and TSI (see Methods). all, all queried windows. darkblue=exon, lightblue=intron, lightgreen=3’UTR, darkgreen=5’UTR.

157 Capítulo 1 uooe 1,4) h ubro aaospeeti h aecrmsm spotd(et ry.Alautosomes All gray). (left, for plotted not or is queried chromosome were same they if the regardless in hg19, chromosome. present for same Ensembl paralogs the from of on come number gene per the genes (19,349), paralog autosomes of Number S16. Fig ihu latr eetrgns(1Osi oa)aesono h ih gen.yai,rltv rqec fthe similar of very frequency are relative distributions y-axis, the that Note (green). chromosome. right same the genes. the significant on on and paralogs shown background of are the number for total) given in a contain all ORs that considering (21 genes YRI genes for receptor genes olfactory significant without of union the from f t aus(e al nmi ae) infiatgenes Significant paper). main in 2 Table (see values NCD o ahpoencdn eefo human from gene protein-coding each For .Sgicn ee mdl,bu)come blue) (middle, genes Significant 2.

158 Capítulo 1

Fig S17. Proportion of Neanderthal SNPs in the candidate windows. Left, significant windows; right, outlier windows. In gray, distribution obtained from 1,000 samplings from the background. In orange, % of Neanderthal SNPs within all significant (or outlier) windows. TSI, Toscani; GBR, Great Britain.

159 Capítulo 1 idw) nbu,mda au o l h idw ihnagvnbn oeta h ein tblz rud20 around stabilize medians the that Note bin. given GBR. a for IS within 15 windows around the and all YRI for for value IS B) median windows); blue, all In of windows). (>99% YRI for 100 and S18. Fig NCD 2 0.5 miia ausfrec i fifraiests IS. sites, informative of bin each for values empirical NCD 2 0.5 o l idw ihI ewe n 0 o B >9 fall of (>99% GBR for 100 and 1 between IS with windows all for A) NCD 2 0.5 o idw ihI ewe 1 between IS with windows for

160 Capítulo 1

Fig S19. Number of paralog genes per gene on the same chromosome. For each OR gene contained within any of the significant sets of windows (all pop- ulations and t f values, 53 ORs in total), the number of paralogs present in the same chromosome is plotted. y-axis, relative frequency of the genes that contain a given number of paralogs on the same chromosome. Compare with distribu- tions in S16 Fig.

161 Capítulo 1

Fig S20. Venn diagrams of candidate windows for four populations. A, left, significant windows; B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR, Great Britain; TSI, Toscani. The set of significant windows for each population comes from the union of significant and outlier windows for tf=0.3, 0.4, 0.5 (see Results and Methods). African populations are shown in tones of purple, and European in tones of green.

162 Capítulo 1 A, left, significant windows; value. t f value. African populations are shown in tones of t f Fig S21. Venn diagrams of significant windowsB, for right, four outlier populations, windows; for YRI, each for Yoruba; LWK, each Luhya; population GBR, Great comes Britain; frompurple, TSI, and from Toscani. European those The in set detected tones of with of significant each green. windows

163 Capítulo 1 o ahpplto oe rmfo hs eetdwt ahf au.Arcnppltosaesoni oe of tones in shown are populations African value. ft windows each green. significant with of of tones detected set in those The European Toscani. from and TSI, from purple, Britain; comes Great each population GBR, for Luhya; each populations, LWK, Yoruba; for four YRI, for windows; windows outlier outlier right, B, of diagrams Venn S22. Fig f t value. ,lf,sgicn windows; significant left, A,

164 Capítulo 2

Acúmulo de mutações deletérias em genes que foram alvos de seleção ba- lanceadora de longo prazo em huma- nos

Considerações Iniciais

Na última década – e particularmente nos últimos 5 anos – diversos trabalhos vêm documentando a existência de uma parcela relativamente elevada de mu- tações deletérias em populações humanas. Alguns estudos encontraram uma diferença entre a carga genética de populações africanas e não-africanas (Hodg- kinson et al., 2013; Lohmueller et al., 2008; Lohmueller, 2014; Henn et al., 2016), ao passo que outros trabalhos vêm contestando tais achados (Do et al., 2015; Simons et al., 2014). Paralelamente, uma série de estudos avaliaram a influên-

165 Capítulo 2 cia que as “varreduras seletivas” têm sobre variantes neutras próximas – redu- zindo a diversidade – e, menos frequentemente, sobre variantes não-neutras – limitando a eficácia da seleção nos sítios ligados (Betancourt e Presgraves, 2002; Chun e Fay, 2011). Até o momento, nenhum estudo buscou avaliar o impacto que a seleção para a manutenção de um polimorfismo balanceado tem sobre variantes deletérias ligadas (exceto para HLA, Lenz et al., 2016). Assim, busca- mos testar a hipótese de que a seleção balanceadora sobre um sítios aumenta a abundância de alelos deletérios ligados que, na ausência de seleção balancea- dora, poderiam ter sido eliminados por seleção purificadora. A fim de abordar essa questão, valemo-nos dos genes com assinaturas de se- leção balanceadora identificados no Capítulo 11. Neste trabalho tive a colabora- ção de Débora Y.C. Brandt (doutoranda, Universidade da Califórnia, Berkeley) e Jônatas E. César (pós-doutorando, Universidade de São Paulo, IB), além de Diogo Meyer, que orientou o trabalho. D.Y.C.B. organizou os dados do Projeto 1000 Genomas para nossas análises, calculou as frequências alélicas por popu- lação e contribuiu com anotações funcionais para os SNPs. J.E.C. desenvolveu scripts eficientes para as abordagens de re-amostragem descritas no manuscrito, além de ter feito o pré-processamento dos dados para nossas análises. Eu parti- cipei de todas as etapas descritas, fiz o planejamento das análises a serem feitas (juntamente com D.M. e J.E.C.) e redigi o manuscrito, juntamente com D.M., com colaboração e aprovação dos outros co-autores. Todos contribuíram para a discussão dos resultados. Pretendemos submetê-lo para o periódico Genetics.

1Referimo-nos neste Capítulo ao manuscrito (não publicado) do Capítulo 1 como Bitarello et al. (n.d).

166 Capítulo 2

Balancing selection drives the accumulation of linked deleterious variation in humans

Bárbara Domingues Bitarello1, Jônatas Eduardo César1, Débora Yoshihara Caldeira Brandt2, Diogo Meyer1

1, Departamento de Genética e Biologia Evolutiva, Universidade de São Paulo, São Paulo, Brazil 2, University of California, Berkeley, USA

Introduction

NDERSTANDING the dynamics and factors that interfere with the ef- ficacy of natural selection is crucial to understanding phenotypes, U complex diseases and genome structure in humans (Brandvain and Wright, 2016). Mutations per se can be neutral, advantageous, slightly deleteri- ous or strongly deleterious and genomic studies have shown that human popu- lations harbour a large number of deleterious mutations (e.g. Eyre-Walker and Keightley, 1999; Kiezun et al., 2013, among several others). Using comparative methods, Eyre-Walker and Keightley (1999) estimated that about 1.6 new deleterious mutations arise per human individual per gen- eration. Estimates of load carried per individual vary between three to five (Morton et al., 1956) and as much as 100 lethal equivalents (Kondrashov, 1995) – i.e, an allele or combination of alleles that if made homozygous would be lethal (Lohmueller et al., 2008). More recent estimates vary from 300 to 1,200 deleterious mutations per diploid (human) genome (Fay, 2011; Lohmueller et al., 2008; Sunyaev et al., 2001). All of these estimates mostly reflect the assumed

167 Capítulo 2 mutation rate, but also rely on effective population size, dominance, and the assumption that populations are in equilibrium (Brandvain and Wright, 2016). Moreover, the methods used in the determination of what makes a mutation “deleterious” vary considerably (reviewed in Henn et al., 2015). Therefore, it is possible that many analyses on the load of deleterious mutations carry inaccu- racies (reviewed in Brandvain and Wright, 2016).

Three processes have a central role in accounting for the abundance and dis- tribution of deleterious mutations in the genome: mutation, drift, and selection (Brandvain and Wright, 2016). Firstly, the balance between influx via muta- tion and removal via purifying selection results in a dynamic process, where a large number of weakly selected variants can be maintained at low frequencies. Exome and genome-wide studies reporting an enrichment of recent deleterious mutations are strong evidence for this process (Casals et al., 2013; Fu et al., 2012; Tennessen et al., 2012; Kiezun et al., 2013).

Secondly, features of a population’s demographic history can influence the load of deleterious mutations that it carries. The last decade has seen an explo- sion of studies comparing the genetic load between human populations (Tishkoff and Williams, 2002; Lohmueller et al., 2008; Lohmueller, 2014; Simons et al., 2014; Henn et al., 2015; Henn et al., 2016). Lohmueller et al. (2008) quantified the number of deleterious mutations per diploid genome in African American (AA) and European Americans (EA) individuals, finding that EA individuals have lower levels of nucleotide heterozygosity for all functional categories analysed, and a higher number of homozygous genotypes for derived alleles in synony- mous and nonsynonymous sites and for "possibly damaging" (Adzhubei et al., 2010) SNPs.

Although the former result is compatible with a rich body of literature doc-

168 Capítulo 2 umenting decreasing levels of heterozygosity with increasing distance from Africa (e.g.Tishkoff and Williams, 2002; Henn et al., 2015), the second obser- vation is not, a priori, expected. Moreover, among the SNPs segregating in only one of the two populations, the proportion of nonsynonymous SNPs was found to be significantly higher in EA (Lohmueller et al., 2008).

This excess of deleterious mutations in European populations was inter- preted as a consequence of a recent out-of-Africa bottleneck (∼ 50,000 years ago) followed by explosive population growth until the present (Lohmueller et al., 2008). Although the African population also experienced growth, it hap- pened further in the past and thus this population would have had enough time to move closer to equilibrium conditions (Lohmueller et al., 2008). Later find- ings supported this hypothesis (Alkan et al., 2009; Subramanian, 2012; Subra- manian, 2016; Hodgkinson et al., 2013; Peischl et al., 2013; Peischl and Excoffier, 2015), while others disputed it (Do et al., 2015; Simons et al., 2014).

A third factor that can account for the load in our genomes is pleiotropy, which is widespread in the human genome. For example, several studies show that disease alleles are often also positively selected, indicating that a deleteri- ous variant has been pushed to a high frequency due to some other contribution to fitness it displays (e.g. Corona et al., 2010).

Here, we explore an additional process that can play a role in shaping the load of mutations: the effect of selection on closely linked loci. It is plausible that at least part of the mutational load in humans is due not to demographic factors, but to indirect consequences of selection in adjacent loci (Figure1).

In this study we take up the task of understanding how balancing selection in humans has shaped the level of load in adjacent loci. It is well understood how selection – directional and balancing – has shaped neutral variation in re-

169 Capítulo 2

Figure 1: Effect of balanced polymorphism on neighboring sites. When an advantageous variant appears in a given haplotype, but the site itself is un- der long-term balancing selection, the advantageous variant increases the fre- quency of neutral and deleterious variants in linkage. Because two or more hap- lotypes are maintained, the linked variants are also kept polymorphic. Adapted from Charlesworth (2006).

gions that lie close to sites under either directional or balancing selection (e.g. Charlesworth, 2006; Charlesworth, 2012; Cutter and Payseur, 2013; Nielsen, 2005, to name a few). Recombination rates and neutral genetic diversity are correlated in several organisms (reviewed in Charlesworth, 2012; Cutter and Payseur, 2013) and, when selective sweeps occur, a depletion of neutral diver- sity is verified around the selected site (Charlesworth, 2012; Cutter and Payseur, 2013; Nielsen, 2005). The extent of this effect is a consequence of both recom- bination rates and the intensity of selection (Roux et al., 2013; Schierup et al., 2000; Charlesworth et al., 1997).

170 Capítulo 2

The effects of directional selection on the accumulation of deleterious mu- tations are less understood, but have also been addressed (e.g. Betancourt and Presgraves, 2002; Chun and Fay, 2011). Interestingly, studies in Drosophila have highlighted that strong purifying (background) selection hampers the effective- ness of natural selection targeting neighboring sites: near strongly selected sites, there is increased accumulation of deleterious mutations and the effectiveness of selection targeting optimal codon usage is lower (Betancourt and Presgraves, 2002). There is also evidence that directional selection limits the efficacy of pu- rifying selection in neighboring sites in humans (Chun and Fay, 2011).

All these findings suggest that there is a complex interaction of different se- lective forces targeting linked sites and possibly that linkage limits the efficiency of purifying selection in purging deleterious mutations from the genome. In this context, we propose to examine the following question: does balancing selec- tion targeting certain sites in the human genome interfere with the effectiveness of natural selection in nearby sites, as has been observed for strong directional selection?

From a theoretical point of view, a first expectation is that balancing selec- tion would increase the rate at which deleterious mutations are purged from the genome. This expectation arises because balancing selection increases the effec- tive population size (Ne) of the genomic region under selection (Charlesworth et al., 1997; Roux et al., 2013; Schierup et al., 2000), and increased Ne leads to an increase in the efficacy of natural selection. The key to understanding this apparent paradox is to consider that balancing selection often involves the oc- currence of partial selective sweeps (i.e., there is an increase in frequency of the favored allele, but not to the point of fixation, followed by other such events, favoring other variants) (Connallon and Clark, 2013; Albrechtsen et al., 2010).

171 Capítulo 2

Such a process enhances diversity, but variation is structured among haplo- types. This process is analogous to the increase in Ne for a structured popu- lation, where each deme has a small Ne, but the overall meta-population has a large Ne (Charlesworth et al., 1997; Roux et al., 2013; Schierup et al., 2000). Therefore, balancing selection should increase the genetic load in the vicinity of the balanced polymorphism.

To our knowledge, an increase in genetic load in the vicinity of targets of bal- ancing selection has only very recently been reported for HLA genes (Mendes, 2013; Lenz et al., 2016) and has also been suggested for the “S” loci in Arabidop- sis and Solanum. In “S” loci it was interpreted in the context of “sheltered load”, which relies on the assumption that deleterious variants are recessive and are less “seen” by purifying selection in regions of high heterozygosity (Stone, 2004; Roux et al., 2013).

Therefore, investigating whether an increased proportion of deleterious vari- ants occurs for targets of balancing selection throughout the entire genome has not yet been examined except in the context of HLA genes. We address this question by investigating the levels of accumulation of deleterious mutations in regions surrounding sites previously detected as targets of balancing selection in a powerful genome-wide approach (Bitarello et al., n.d.). Given the broad range of methods that can be used to identify deleterious variants, and the fact that they are frequently not in agreement (reviewed in Henn et al., 2015), we opt to use three complementary approaches. Moreover, most deleteriousness measures are negatively correlated with allele frequencies, thus we explicitly consider the effects of allele frequencies in our analyses.

Our expectation, given the theoretical background outlined above, was that regions with evidence for balancing selection would show an enrichment of

172 Capítulo 2 deleterious variants. In accord with this expectation we found strong evidence for an increased proportion of nonsynonymous variants within genes with sig- natures of long-term balancing selection (LTBS), as well as evidence for an in- creased proportion of deleterious variants.

Methods

Population datasets

In order to test the hypothesis that sites within genes with evidence for long- term balancing selection (LTBS) show an excess of deleterious variants, we con- sidered all protein-coding SNP positions (nonsynonymous, N, and synonymous, S) from the 1000 Genomes Phase 3 data (Auton et al., 2015). We selected SNPs that fall within the coordinates of genes with signatures of LTBS (“balanced genes”, see below) or within the target windows per se (“balanced windows”) (Tables1 and2).

We used the integrated call sets in VCF format for each chromosome, and calculated reference and alternative allele frequencies per population using VCFtools (Danecek et al., 2011). We only considered populations from Africa and Eu- rope, and excluded the admixed ones, thus resulting in 10 populations: [Africa: Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Mende in Sierra Leone (MSL), Gambian in Western Division, The Gambia (GWD), Esan in Nigeria (ESN)]; [Europe: Toscani in Italy (TSI), British in England and Scot- land (GBR), Iberian populations in Spain (IBS), Finnish in Finland (FIN), Utah residents with Northern and Western European ancestry (CEU)]. We did not in- clude Asian populations because the targets of balancing selection defined by

173 Capítulo 2

Bitarello et al. (n.d.) were only documented for African and European popula- tions.

Targets of balancing selection

The "balanced genes" are those reported by Bitarello et al. (n.d.) (see their Table 3) as having the strongest statistically significant signatures of LTBS in humans (213 genes in total). The list of balanced genes was generated by intersecting 3 Kb windows with strong signatures of LTBS with the protein-coding gene anno- tation from Encode/Ensembl (Bitarello et al., n.d.). Here we test the hypothesis of an enrichment in the proportion of deleterious variants in the balanced genes per se (Table1).

A more specific definition of the regions under balancing selection would in- volve the analysis of "balanced windows" (i.e., the queried sub-region of a gene with evidence for balancing selection, according to the method of Bitarello et al. (n.d.). However, this approach generates a dataset which is too restrictive (Table 2), with a number of SNPs that is too small to provide reliable contrasts among regions under balancing selection with respect to the rest of the genome (with on average 209 protein-coding SNPs per population, after the HLA genes are removed). We therefore chose to restrict our analyses to the balanced genes, for which we have a larger number of SNPs documenting the influence of selection on nearby sites.

Our approach consists in a comparison of the number of deleterious SNPs within the genes under balancing selection and the remainder of the genome (which we refer to as providing a set of "control SNPs"). We define control SNPs as those protein-coding SNPs outside all balanced genes. We did not consider the SNPs contained in the sex chromosomes nor in the mitochondrial

174 Capítulo 2

DNA since there is no information regarding balancing selection signatures for those genes (Bitarello et al., n.d.). More details about the controls are provided below in the "Re-sampling control SNPs" section.

Annotation

One of the summary statistics used to quantify the genetic load was the CADD (Combined Annotation Dependent Depletion, or simply “C Score”) described in Kircher et al. (2014). Thus, we used the annotation provided in the study of Kircher et al. (2014) (available at: http://cadd.gs.washington.edu/download, accessed in March 2016).

From this annotation file we retrieved the following information: "CHR" (chromosome where the SNP is situated), "POS" (position of SNP in chromo- some), "Consequence" (nonsynonymous, synonymous, 3-prime-UTR, 5-prime- UTR, intronic, non-coding change, canonical splice,stop-gained, stop-lost), "Gene- Name" (associated gene name to the SNP position), "AnnoType" (NonCoding- Transcript, Transcript, CodingTranscript), "PolyPhenCat" (benign, possibly dam- aging, probably damaging), and the scaled and raw C scores (Kircher et al., 2014).

Because two of our measurements of genetic load (see below) are only ap- plicable to protein-coding sites, we restricted our analyses to these categories. Thus, all quantification of deleterious load was restricted to sites which are protein-coding (i.e., N or S) (Tables1 and2). For the step where we iden- tified the specific SNPs with highest heterozygosity for each gene (and which is/are the putative target(s) of balancing selection, see below) we used the com- plete set of sites within the gene, since it is plausible that sites which are not protein-coding are under balancing selection.

175 Capítulo 2

Given that the effect of balancing selection on the load of nearby regions is also expected to affect sites which are functional but not protein-coding, our ap- proach could theoretically be extended to this class of sites (for example, using a measure of deleteriousness such as the C score, which is applicable to these sites as well). However, in the present study we opted to restrict our analyses to variants that affect protein-coding sequences. We justify this based on the fact that assignment of deleteriousness at these sites can be performed by estimating the ratio of nonsynonymous to synonymous polymorphisms, a measure which is based exclusively on the nature of the variants, with no direct influence of allele frequencies and phylogenetic conservation.

Quantifying genetic load

All protein-coding SNPs from the set of balanced SNPs (Table1) were jointly considered when calculating the statistics below, i.e, a single estimate of load was made for the entire set of genes with evidence for balancing selection. This avoids the difficulty in obtaining reliable estimates when computing load for individual genes, since these often have a small number of SNPs. For controls, the same approach was adopted for each re-sampled set of SNPs (details below).

Ratio of nonsynononymous to synonymous polymorphisms

We calculated the ratio of the number of nonsynonymous (PN) to synonymous polymorphisms (PS) for each set of SNPs (from balanced genes and controls):

PN PN/PS = (1) PS

This ratio provides a measure of of the proportion of deleterious mutations,

176 Capítulo 2 Cscore , numbers of nonsynonymous and Benign S P and 2 N del P P . Genes under balancing selection were defined 1 del P Balanced genes S , and Benign, numbers of possibly damaging, probably damaging and 2 del P , N 1 del P # sites IBS 2,612(2,153) 1,507(1,200) 1,105(953) 171(155) 237(216) 1,034(764) 10.82(11.71) TSI 2,596(2,146) 1,479(1,181) 1,117(965) 175(161) 230(211) 997(733) 10.88(11.82) YRI 3,423(2,961) 1,871(1,564) 1,552(1,397) 197(182) 262(238) 1,300(1,034) 10.72(11.36) FIN 2,149(1,715) 1,230(948) 910(767) 138(124) 173(159) 864(610) 10.33(11.36) ESN 3,261(2,821) 1,804(1,517) 1,457(1,304) 202(187) 265(248) 1,238(984) 10.91(11.63) POP GBR 2,299(1,866) 1,328(1,043) 971(823) 155(140) 188(173) 919(664) 10.69(11.70) MSL 3,387(2,937) 1,837(1,543) 1,550(1,394) 198(184) 241(222) 1,273(1,013) 10.63(11.26) CEU 2,353(1,925) 1,334(1,054) 1,019(871) 145(132) 198(183) 924(672) 10.66(11.66) LWK 3,587(3,126) 1,959(1,654) 1,628(1,472) 213(194) 281(260) 1,357(1,093) 10.85(11.52) GWD 3,550(3,098) 1,967(1,668) 1,583(1,430) 198(181) 292(272) 1,353(1,091) 10.75(11.35) Statistics for protein-coding SNPs within the balanced genes Table 1: by Bitarello et al.Cscore, ( n.d. ). average scaled Numbers C in score parentheses for refer all to SNPs the in datasets the after set removal (see of Methods). HLA genes (see Methods). synonymous sites, respectively. benign variants (Adzhubei et al., 2010 ), respectively.

177 Capítulo 2

Balanced windows

POP # sites N S Pdel1 Pdel2 Benign Cscore YRI 635(225) 401(124) 234(102) 23(9) 30(9) 333(93) 7.50(9.26) LWK 651(236) 400(122) 251(114) 29(12) 28(8) 332(9) 7.34(9.07) MSL 636(234) 391(124) 245(110) 20(7) 29(10) 327(93) 7.52(9.29) GWD 652(248) 406(136) 246(112) 29(13) 35(16) 328(93) 7.88(9.91) ESN 630(235) 387(125) 243(110) 28(13) 24(8) 321(91) 7.46(9.43) TSI 612(205) 388(113) 224(92) 27(13) 33(15) 317(75) 7.62(9.93) GBR 561(171) 361(100) 200(71) 24(9) 28(14) 302(70) 7.25(9.31) FIN 553(166) 351(91) 202(75) 22(8) 24(10) 297(65) 7.02(8.80) CEU 559(172) 353(96) 206(76) 19(6) 28(14) 297(67) 7.17(9.29) IBS 607(196) 384(107) 223(89) 26(11) 33(15) 314(70) 7.46(9.58)

Table 2: Statistics for protein-coding SNPs within balanced windows. Win- dows are defined in Bitarello et al. (n.d.). Cscore, average scaled C score for all SNPs in the set (see Methods). PN and PS, numbers of nonsynonymous and synonymous sites, respectively. Pdel1, Pdel2, and Benign, numbers of possibly damaging, propably damaging and benign variants, respectively (Adzhubei et al., 2010).

178 Capítulo 2 assuming that a large proportion (Eyre-Walker and Keightley, 1999; Subrama- nian, 2012) of nonsynonymous mutations are either strongly or mildly deleteri- ous. However, because nonsynonymous variants include those that are adap- tive or neutral, we also considered alternative statistics which quantify genetic load.

Ratio of damaging to synonymous polymorphisms

To estimate the number of damaging alleles in the “balanced” and control sets of SNPs, we used the PolyPhen-2 (Adzhubei et al., 2010) annotation provided in Kircher et al. (2014). PolyPhen-2 classifies nonsynonymous variants as ei- ther benign, possibly damaging (Pdel1) and probably damaging (Pdel2). We thus defined the ratio of damaging to synonymous SNPs (Lohmueller et al., 2008) as:

Pdel1 + Pdel2 Pdel/PS = (2) PS

This estimate quantifies the proportion of SNPs most likely to be deleteri- ous. PolyPhen-2 is a protein-level metric which is, by definition, restricted to nonsynoymous sites. Moreover, several nonsynonymous SNPs (∼ 20% in the

1000 Genomes dataset) lack PolyPhen-2 annotation, so the Pdel/PS statistic was calculated based on a smaller set of SNPs than the PN/PS (Tables1 and2) and has higher variance.

CADD (C score)

The C score was provided by the CADD tool (Kircher et al., 2014). The C score is a composite measure using information from more than 60 different such meth- ods to quantify the effects of a mutation and has been shown to differentiate lev-

179 Capítulo 2 els of deleteriousness among groups of SNPs (Kircher et al., 2014). As argued by Kircher et al. (2014), protein-level metrics such as PolyPhen-2 (Adzhubei et al., 2010) are the best performing individual annotations (Kircher et al., 2014), but are restricted to nonsynonymous variants, whereas conservation scores such as GERP++ (Davydov et al., 2010) cannot distinguish between nonsynonymous and stop-loss variants at a given position.

We used both the scaled and the raw C scores provided for the 1000G phase 3 SNPs. The scaled C scores range from 1-99 (higher values indicating higher deleteriousness potential). Although counting the number of SNPs above a cer- tain threshold could be used as a strategy, we used the approach of comparing distributions of C scores between groups in order to increase power (Kircher et al., 2014). Throughout the discussion, when C scores are presented they re- fer to the average C score of all N + S SNPs contained in a given set of SNPs (balanced or control). We restricted the analyses to these sites so as to make the results comparable to those of the two other metrics for measuring deleterious- ness.

n n

∑ (CNi ) + ∑ (CSi ) = = Cscore = i 1 i 1 (3) N + S

, where n is the total number of SNPs in the set of SNPs, CNi and CSi are the C scores for N and S SNPs contained in the set of SNPs. The overall C score used in our analyses is thus an average of the C scores of all N + S SNPs contained within the “balanced” and control sets of SNPs.

Scaled C scores are very useful for identifying a top ranked SNP and easier to interpret, but raw C scores offer superior resolution for comparison of dis- tributions of scores between groups of variants (Kircher et al., 2014). Thus, we

180 Capítulo 2 also compared the distribution of raw C scores for SNPs within balanced genes to those of the re-sampled sets of controls, and performed a one-tailed Mann- Whitney U-test (the alternative hypothesis being that SNPs from balanced genes have higher raw C scores) to compare the balanced SNPs’ distribution to each control replicate (significance threshold 5%).

Re-sampling control SNPs

To test the hypothesis that balanced genes are enriched for deleterious SNPs, we compared the three statistics that measure deleteriousness between the SNPs contained within the genes under balancing selection and a random sample with the same number of SNPs, but chosen from genes with no evidence for balancing selection (controls) (Table1). We use the distributions of control SNPs to obtain an empirical p-value for the SNPs from balanced genes, defined by the fraction of re-sampled distributions with deleteriousness statistics which are more extreme (i.e, higher) than those of the SNPs from the balanced genes.

Previous studies have shown, and we confirm here (see Results) that there is a strong correlation between allele frequency and the probability of a variant being annotated as deleterious (see Results). Because genes/regions under bal- ancing selection are enriched for SNPs at intermediate frequencies (i.e. higher heterozygosities), this effect will itself result in a marked difference between measures of load for balanced genes and the genomic background. In order to control for this effect and guarantee that differences in load are attributable specifically to the effects of linked selection, we compared the proportion of deleterious variants in balanced genes/windows to those of the control sets of SNPs after matching the control SNPs to the frequencies of those in the “bal- anced” set. Next we describe the procedure used to re-sample a set of SNPs

181 Capítulo 2 while controlling for frequency.

Once the protein-coding SNPs from balanced genes had been selected, we followed a similar approach as the one adopted by Subramanian (2016): (1) we took the MAF (minor allele frequency) of each protein-coding SNP; (2) we cal- culated the log (base 10) of the MAF (logMAF), because in humans the MAF follows an exponential distribution, i.e, a huge proportion of alleles have very low MAF (e.g. Abecasis et al., 2012; Subramanian, 2016); (3) we divided the SNPs into bins according to the logMAF (in our case, we used 9 bins rang- ing from logMAF=-0.24, 0.1 with a 0.25 interval, encompassing a MAF range of 0.00398-0.5). Given that we did not expect balancing selection to favour derived or ancestral variants preferentially (Bitarello et al., n.d.), using the MAF is ap- propriate and does not require further filtering of data in order to infer ancestral and derived states. Importantly, once the set of SNPs from balanced genes were divided into bins of logMAF, we were able to quantify the relative contribution of SNPs to each bin, thus allowing the re-sampled sets of SNPs to match the site-frequency spectrum (SFS) of the SNPs observed in balanced genes. We re- sampled from the control SNPs a set following the proportions of each logMAF bin and the total number of SNPs within the set of target genes (Table1).

This re-sampling schema was designed to account for the fact that all of the genetic load measurements adopted here correlate negatively with allelic fre- quency (see Results) and that there is an enrichment of intermediate-frequency alleles among the balanced genes (Bitarello et al., n.d.). Each SNP was sampled independently of its location, provided that it was protein-coding, autosomal and matched the logMAF proportions calculated based on the balanced genes. This means that each control set had the same number of protein-coding SNPs as the balanced genes’ set and a similar SFS (Table1), but those SNPs were not

182 Capítulo 2 necessarily attributed to the same number of genes as those for the balanced genes.

Excluding adaptively maintained SNPs from load estimates

Our goal is to test the hypothesis that SNPs within genes under balancing selec- tion have a higher proportion of deleterious variants than expected for a set of control SNPs. However, an excess of deleterious or functional variation could be an outcome of the direct effects of balancing selection. For example, the un- usually high proportion of nonsynonymous polymorphism in HLA genes is a consequence of balancing selection directly on functional sites, and not of dele- terious variants accumulating as a byproduct of selection on a specific site (e.g. Hughes and Nei, 1988; Bitarello et al., 2015).

In order to separate the direct effects of balancing selection from those due to hitch-hiking, we also calculated the genetic load measurements after excluding the sites which are the strongest candidates for balancing selection (thus justi- fying the assumption that the remainder of the highly polymorphic variants are present due to linkage with this selected variant). This approach relies on the assumption that one or at most one or a few sites are the targets of balancing selection within each gene.

For each balanced gene, the putatively selected SNPs were identified by lo- cating within the outlier window with evidence for balancing selection (as re- ported in Bitarello et al., n.d.) the site with the highest heterozygosity. This SNP was then excluded from the set of SNPs for the balanced genes. When a bal- anced gene had more than one balanced window, we chose the one with the most extreme signature of LTBS (Bitarello et al., n.d.).

183 Capítulo 2

Heterozygosity was calculated as follows:

∗ Hi = 2 · [MAF · (1 − MAF)] (4)

, where i is each SNP position and MAF is the minor allele frequency for that position. For this exclusion step, all coding SNPs were considered, not only N and S, and most of the excluded SNPs were intronic (∼ 90% per population, Table3). When the most extreme heterozygosity in a gene was shared by mul- tiple SNPs, all were removed. The average number of SNPs removed per gene, across all populations, was 3.8 SNPs. Also, few genes had one or more N or S SNPs removed by this filter (average 14 genes out of 213, per population). Overall 71 unique (29 N and 42 S) SNPs were removed across all populations (average 11 N and 34 S per population, Table3).

POP All Intron N S 3’UTR 5’UTR Splice YRI 762 681 18 19 39 4 1 LWK 663 599 8 9 41 5 1 MSL 664 599 15 13 33 4 0 GWD 693 633 12 14 28 5 1 ESN 692 621 16 19 32 3 1 TSI 763 704 8 16 33 2 0 GBR 841 761 13 15 46 4 2 FIN 740 669 8 14 43 6 0 CEU 754 700 5 15 33 1 0 IBS 715 670 12 12 14 7 0

Table 3: Classes of SNPs with the highest heterozygosity(ies) per gene For each gene, the SNP(s) with the highest heterozygosity(ies) were removed (All). Only SNPs contained within the outlier windows (Bitarello et al., n.d.) of those genes were considered. Splice, splice-site position, 3’ and 5’ UTR, 3 and 5 prime UTR regions, N, S, Intron, nonsynonymou, synonymous and intronic sites.

Given that the assumption that a balanced gene/window has one or a few sites which is/are the actual target(s) of selection is not reasonable for the HLA

184 Capítulo 2 genes – where several sites are targets of balancing selection (Hughes and Nei, 1988; Bitarello et al., 2015) – we performed our analyses under two scenarios: ei- ther keeping the HLA genes or removing them. For these analyses we removed the following HLA genes, which have prior strong evidence for long-term bal- ancing selection and are included among the outlier genes in Bitarello et al. (n.d.): HLA-A,HLA-B, HLA-C, HLA-DRB1, HLA-DRB5, HLA-DPA1, HLA-DPA2, HLA-DPB1, HLA-DPB2, HLA-DQB1, HLA-DQB2, HLA-DQA1, HLA-DQA2. Their removal changes the proportion of target SNPs that fall into each bin of logMAF, with the SFS becoming less enriched for intermediate frequency variants.

All analyses and figures were generated in R (Development Core Team, 2009) and scripts are available: calculation of allele frequencies per population for 1000 Genomes Phase 3 data (https://github.com/deboraycb/1000Gstats_ inR/); load analyses, re-sampling and all figures (https://github.com/bbitarello/ deleterious_mutations, access can be provided upon request).

Results

The site frequency spectrum of balanced genes

Here, we consider as "balanced genes" the set described as having the strongest signatures of LTBS in Bitarello et al. (n.d.). Balancing selection shifts the SFS towards intermediate frequencies (Andrés et al., 2009; Bitarello et al., n.d.). Al- though the selected sites may only comprise a subset of the entire locus, bal- ancing selection changes levels of polymorphism at adjacent sites (neutral and non-neutral), thus generating a signature that allows selected genes to be de- tected (Bitarello et al., n.d.). It is entirely plausible, and likely, that only portions

185 Capítulo 2 of those genes are the targets of balancing selection, and this provided us with an appropriate dataset that has the putative site(s) that were selected and their immediate vicinities, which show signatures of LTBS (as seen in Figure 5 of Bitarello et al., n.d.).

The SFS of the balanced genes is shifted towards intermediate frequencies when compared to the genomic background distribution (Figure2). Specifi- cally, balanced genes have about 10% less variants in the lower bins of frequency (MAF ≤ 0.0025). In other words, the balanced genes have a different SFS from that of the background, and an appropriate re-sampling of control SNPs needs to account for this property of the balanced genes.

The SNPs in the balanced genes were binned according to their MAFs (Ta- ble4), and their distribution into the bins was used for the re-sampling proce- dure for the controls. Because signatures of LTBS are expected to be restricted to narrow windows (Andrés et al., 2009; Andrés, 2011; Bitarello et al., n.d.; Charlesworth, 2006) and here we consider the entire gene, this shift towards intermediate frequencies is modest.

Measures of deleteriousness correlate negatively with allelic fre- quency

Previously, Lohmueller et al. (2008) reported that SNPs classified as "damaging" according to PolyPhen had significantly lower mean derived allele frequencies (DAF) than "benign" SNPs, with the "probably damaging" category having the lowest mean DAF.

More generally, nonsynonymous variants are expected to have lower fre- quencies (Brandvain and Wright, 2016), because purifying selection will have

186 Capítulo 2 9 8 7 6 5 Bins 4 3 ) in the bin; MAF, minimum and maximum MAF values observed within the bin. MAFs are S + 2 N 1 n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF Binning of SNPs according to their minor allelic frequencies IBS 825 0.5-0.5 173 0.9-0.9 176 1.4-1.8 158 2.3-3.7 191 4.2-7 176 7.5-12.1 311 12.6-22 392 22.4-39.7 210 40.2-50 TSI 820 0.5-0.5 173 0.9-0.9 167 1.4-1.9 175 2.3-3.7 171 4.2-7 177 7.5-12.1 329 12.6-22 390 22.4-39.7 194 40.2-50 YRI 905 0.5-0.5 277 0.9-0.9 317 1.4-1.8 348 2.3-3.7 292 4.2-6.9 320 7.4-12.5 361 13-22.2 416 22.7- 39.3 187 39.8-50 FIN 409 0.5-0.5 143 1-1 171 1.5-2 153 2.5-3.5 196 4-7 203 7.6-12.1 304 12.6-22.2 370 22.7-39.4 304 39.9-50 Pop ESN 790 0.5-0.5 303 1-1 306 1.5-2 268 2.5-3.5 334 4-7.1 300 7.6-12.1 372 12.6-22.2 388 22.7-39.4 200 39.9-50 GBR 586 0.55-0.55 162 1.1-1.1 150 1.6-2.2 145 2.7-3.8 157 4.4-6.6 194 7.1-12.1 313 12.6-22 363 22.5-39.6 229 40.1-50 MSL 894 0.6-0.6 304 1.12-1.2 208 1.8-1.8 333 2.3-3.5 364 4.1-7 309 7.6-12.3 390 12.9-22.3 390 22.9-39.4 195 40-50 CEU 644 0.5-0.5 147 1-1 138 1.5-2 160 2.5-3.5 194 4-7.1 165 7.6-12.1 325 12.6-22.2 376 22.7-39.4 204 39.9-50 LWK 977 0.5-0.5 386 1-1 355 1.5-2 283 2.5-3.5 335 4-7 282 7.6-12.1 390 12.6-22.2 383 22.7-39.4 196 40-50 GWD 974 0.4-0.4 304 0.9-0.9 395 1.3-2.2 269 2.6-3.5 322 4-6.6 311 7.1-12.4 368 12.8-22.1 404 22.6-39.4 203 39.8-50 Table 4: SNPs in balanced genes were binnedbalanced according genes to with their HLA logMAF SNPs included. valuesn, (see When number Methods). HLA of SNPs Values SNPs correspond were ( excluded, to the thegiven bin set in proportions of %. changed (not shown).

187 Capítulo 2 ee ga) etnlsietf h etosta r omdi n()ad(C). and (B) in in zoomed are that sections the identify Rectangles (gray). genes 2: Figure iefeunysetu o rti-oig( protein-coding for spectrum frequency Site N + S SNPs ) Nscm rmblne bu)adcontrol and (blue) balanced from come SNPs

188 Capítulo 2 had enough time to purge deleterious variants from the population (assuming most nonsynonymous variants are deleterious). This is in fact the pattern seen for human populations, where there is a vast excess of low frequency nonsyn- onymous variants (Casals et al., 2013; Fu et al., 2012; Tennessen et al., 2012). Moreover, C scores also tend to be higher for lower frequency variants (Kircher et al., 2014), although it has been shown that C score distributions have power to differentiate lead-SNPs and tag-SNPs from GWAS, which by definition have similar frequencies (Kircher et al., 2014). We confirmed these patterns with the 1000 Genomes data we analyzed (Fig- ure3). When dividing all protein-coding SNPs (whether they fall into balanced genes or not) into bins of minor allele frequencies (logMAF, see Methods), a clear negative correlation is observed between MAF and the three statistics:

PN/PS, Pdel/PS and Cscore (Figure3). All of the aforementioned observations indicate the importance of control- ling for allele frequencies when analyzing the load of deleterious mutations among balanced genes. Lack of a control would cause higher load among con- trol SNPs than for the SNPs from balanced genes, as a consequence of an en- richment in intermediate frequency variants in balanced genes (Bitarello et al., n.d.) and the fact that deleterious variants are more abundant in the lower bins of MAF (Adzhubei et al., 2010; Kircher et al., 2014; Lohmueller et al., 2008; Sub- ramanian, 2016).

189 Capítulo 2

Pop PN/PS-Cscore Pdel/PS-Cscore PN/PS-Pdel/PS YRI 0.23 0.55 0.61 LWK 0.27 0.61 0.62 MSL 0.28 0.59 0.65 GWD 0.28 0.61 0.65 ESN 0.26 0.58 0.63 TSI 0.28 0.60 0.67 GBR 0.27 0.60 0.65 FIN 0.27 0.56 0.65 CEU 0.28 0.59 0.68 IBS 0.36 0.67 0.70

Table 5: Pearson’s correlations between load statistics. Each value corresponds to 1,000 re-samplings (controls) for the balanced SNPs (gene-based). All corre- lations are highly significant (p − value < 2.6e−13).

Although our three measures of deleteriousness differ in the criteria used to define/quantify how damaging variants are, we find an overall high correlation among these measures (Figure4). PN/PS and Pdel/PS are highly correlated in −13 all populations (cor> 0.61 and p < 2.6e , Table5), and Cscore and Pdel/PS as well (cor> 0.55, Table5). PN/PS and Cscore have a weaker correlation, albeit also highly significant (Figure4A and Table5).

These results indicate the importance of using a re-sampling approach that controls for differences in frequencies of SNPs in balanced genes and genomewide (Table4). Re-sampling sets of genes which are not under balancing selection, without controlling for the SFS, would lead the control to be relatively enriched for low frequency variants, a factor which would obscure the identification of possible differences between the balanced and control SNPs.

190 Capítulo 2 , (C) Cscore. S P / del P , (B) S P / N P SNPs were included here. Bins were S + N All autosomic Boxplot of load statistics by each bin of MAF. Figure 3: defined based of the log(base 10) of the MAF of each variant (see Methods). y-axis, (A)

191 Capítulo 2 ape e fSP.Lnsrpeetlna ersin o ahpplto.Frcreainvle e population, per values correlation For population. each for regressions linear 5 . represent Table Lines see SNPs. of set sampled p-value Pearson, bined, iue4: Figure p < 2.2 e − orltosbtenla umr statistics. summary load between Correlations 16 .Ec oo orsod ooepplto n ahpiti eest h ercetmtdfrare- a for estimated metric the to refers is point each and population one to corresponds color Each ). < 2.2 e − 16 ,(B) ), P del / P S n soe(cor=0.91, Cscore and A cr and score C (A) p < 2.2 P e − N 16 / P n (c) and ) S cr08,fralppltoscom- populations all for (cor=0.80, P del / P S and P N / P S (cor=0.91,

192 Capítulo 2

Extreme values for HLA SNPs

In the scan for balancing selection of Bitarello et al. (n.d.), HLA genes were over-represented among the category of selected genes, and showed extremely strong evidence for selection, with 12 classical HLA genes present among the 213 genes with strongest signatures of balancing selection. This observation, associated to the fact that HLA genes are likely to carry several sites under the direct effects of balancing selection, led us to single them out for an exploratory analysis.

We initially compared the load statistics for the set of SNPs from balanced genes to a group comprising all control SNPs in the genome. Note that here there was no re-sampling involved; we simply compared the statistics between the different groups. We evaluated the influence of SNPs from HLA genes on the load summary statistics by excluding all HLA SNPs from the set of SNPs contained in balanced genes.

The set of SNPs from the balanced genes have different values for the three statistics when compared to the control set of SNPs: PN/PS values are higher

(Figure5), while Pdel/PS and C score values are lower for the SNPs from all balanced genes (Figure5). Interestingly, the HLA set of SNPs follows the same pattern as the SNPs from the set of all balanced genes (which include the HLA SNPs), although in a much stronger way. When we examine HLA genes alone, we find that the average PN/PS for these loci is almost 2-fold greater than that of control SNPs (Figure5). Similarly, the reduction in Pdel/PS and C score in HLA compared to control SNPs is also about two-fold (Figure5). Moreover, when HLA SNPs are removed from the set of SNPs from balanced genes, the remaining set tends to have values closer to controls, albeit still different (Figure

193 Capítulo 2

5).

The extreme patterns of HLA SNPs for these three statistics could be driving the patterns seen in the SNPs from balanced genes, of which they are part of.

The PN/PS result is conservative in the sense that, although one could expect lower values for HLA genes (less low frequency and thus less nonsynonymous variants), it is actually almost two-fold higher (Figure5). This observation likely results from the high number of sites that are actively maintained by bal- ancing selection in these genes. It is a well-known fact that balancing selection has targeted several sites in HLA genes (e.g. Hughes and Nei, 1988; Yang and Swanson, 2002; Bitarello et al., 2015), which could at least partially explain the patterns observed for PN/PS. The mechanisms driving diversity in the other balanced genes, however, remain largely unknown, and it is reasonable to as- sume that only one (or a few) site(s) has been targeted by balancing selection in a given gene (e.g. in Leffler et al., 2013 i.e, that the HLA represents the excep- tion, rather than the rule.

However, there is no obvious biological explanation as to why Pdel/PS and C scores should be reduced in HLA compared to control SNPs. This suggests that the reason might be related to allelic frequencies. Moreover, the HLA genes are enriched not only in intermediate frequency alleles (which are less likely to be deleterious) but also in number of polymorphic sites (Robinson et al., 2013). Thus, although HLA genes represent only 12 out of 213 balanced genes con- sidered here, given the high SNP density of the MHC region as a whole, they account for a considerable proportion of the SNPs from balanced genes (Table 1). Thus, in the remaining analyses we estimated load for the set of SNPs from balanced genes with and without the inclusion of HLA SNPs. Because the HLA SNPs change the shape of the SFS, we also re-sampled the control SNPs accord-

194 Capítulo 2 ingly.

Increased nonsynonymous to synonymous SNPs in balanced genes

Firstly, we note that PN/PS values are on average higher for European than for African populations (Figure6). This confirms the finding that European populations have a higher proportion of nonsynonymous variants than African populations (Lohmueller et al., 2008). Since our re-sampling was done by popu- lation, we intrinsically take this into account as seen in the control distributions (Figure6).

The PN/PS values of SNPs from balanced genes are significantly higher than controls (p < 0.01; Figure6A). These results are not explained by the HLA genes: although their removal reduces the balanced PN/PS (as expected), while slightly increasing the control values (because the target SFS changes, thus re- sulting in less intermediate frequencies in the controls), the increase in of bal- anced genes with respect to the controls remains significant, albeit less extreme (Figure6B, P < 0.01 for all African populations and GBR and FIN). One Eu- ropean population has marginally significant values (P = 0.06, TSI) and for two others (CEU and IBS) PN/PS falls within the control distribution after the removal of HLA SNPs (P > 0.24, Figure6B).

The increased PN/PS for balanced genes is also not likely driven by the puta- tive target(s) of balancing selection in these genes: when PN/PS for the balanced genes was estimated after removing the SNP(s) with the highest heterozygos- ity(ies) for each gene (see Methods), results remain qualitatively similar (Figure 6). In fact, most of the SNPs excluded from the balanced genes were intronic (∼ 90% for all populations, Table3), and on average 11.5 N and 14.6 S SNPs were removed from each population (Table3) which makes the PN/PS esti-

195 Capítulo 2 iue5: Figure ee;HA e f1 L ee seMtos.Ec opo scmoe f1 aapit,ec n corresponding one each points, data 10 of (A) composed y-axis, is balanced boxplot in Methods). Each contained (see not Methods). population SNPs (see a all genes control, HLA to 12 Methods); of (see set genes a HLA HLA, 12 genes; excluding but category, previous the as same opo fla ttsisst fSNPs. of sets statistics load of Boxplot P N / P S (B) , aacd Nscnandi h aacdgns balanced.no.HLA, genes; balanced the in contained SNPs Balanced, P del / P s C Cscore. (C) ,

196 Capítulo 2 mates increase slightly in most cases (Figure6).

These results suggest there is an excess of nonsynonymous variants within the set of balanced genes, and that this excess, at least for African populations, is not entirely explained by the presence of HLA genes (Figure6B) nor by the presence of a one or a few SNPs per gene that have very high heterozigosities (and are presumably the actual targets of selection). For the European popula- tions the removal of HLA SNPs decreased the estimates in relation to controls in a more pronounced way.

Because nonsynonymous variants are not necessarily deleterious (they might also be neutral), we also investigated two other measures of load that directly quantify deleteriousness.

Increased proportion of damaging to synonymous SNPs in bal- anced genes

Again, we note that European populations have higher balanced and control values than African populations, as seen previously (Lohmueller et al., 2008).

When comparing Pdel/PS estimates for SNPs from balanced genes and control sets of SNPs, a similar pattern emerges, although less extreme than the one seen for PN/PS: balanced genes tend to have higher load compared to controls (p < 0.05) for all populations, except CEU and IBS (p > 0.14; Figure7A).

The removal of HLA SNPs only slightly changes the Pdel/PS, and the quali- tative relationship between them does not change, with all populations except CEU and IBS having p < 0.05 (Figure7). This differs from what was observed for PN/PS, where the removal of HLA SNPs made the estimates of load for balanced genes less different from controls, although still highly significant.

197 Capítulo 2

Moreover, the Pdel/PS estimates with and without the removal of SNPs with the highest heterozygosity per gene only slightly increase the estimates, com- patible with the observation that few of the removed SNPs with this filter are nonsynonymous, and always less than the number of synonymous SNPs (Table 3).

The results for Pdel/PS are in agreement with what was observed for PN/PS, suggesting that the patterns observed for PN/PS are driven by deleterious, and not adaptive or neutral nonsynonymous variants.

Increased C-scores in balanced genes

Average scaled C scores yield qualitatively different results with respect to anal- yses based on PN/PS and Pdel/PS. Firstly, for African populations and for TSI the load estimates for balanced genes are very elevated (p < 0.01) compared to controls, similarly to what was seen for PN/PS (Figure8A). However, this pattern is not observed for the other European populations, with p-values ap- proaching one for CEU and IBS (Figure8A). Interestingly, in this case, the re- moval of HLA SNPs enhances the signal: African values become even more extreme and all the European populations acquire extreme values when com- pared to controls as well (p < 0.01 for all populations, Figure8B).

Given that the most appropriate set of SNPs for testing our hypothesis of load is the set without HLA genes and without the SNPs with the highest het- erozigosities per gene (Figure8B), pink triangle), it is plausible that the reduc- tion or loss of significance of the load in the set of all balanced genes (in Africa and Europe, respectively) is due to the excess of adaptive variants (from HLA or other genes) present in the complete set, which tend to have lower C scores. Note that the control distributions in Figures8A and8B are very similar, and

198 Capítulo 2 0.05. Reported p-values are for the estimates with all SNPs. < p A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, estimate for all 0.01, * < value − p for balanced genes S P / N P Figure 6: protein-coding SNPs in thegene set; (see pink Methods. triangle, **, estimate after removal of SNP(s) with highest heterozygosity in each

199 Capítulo 2 iue7: Figure rti-oigSP ntest iktinl,etmt fe eoa fSPs ihhgethtrzgst neach in heterozygosity highest with SNP(s) of removal after estimate **, triangle, Methods). pink (see set; gene the in SNPs protein-coding P del / P S o aacdgenes balanced for p < .1 * 0.01, p < )IcuigHASP;B eoigHASP.Bu ice siaefrall for estimate circle, Blue SNPs. HLA removing B) SNPs; HLA Including A) .5 eotdpvle r o h siae ihalSNPs. all with estimates the for are p-values Reported 0.05.

200 Capítulo 2 A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, 0.05. Reported p-values are for the estimates with all SNPs. < p 0.01, * < p Average scaled Cscore for balanced genes. Figure 8: estimate for all protein-coding SNPsgosity in in the each set; gene pink (see triangle, Methods). estimate **, after removal of SNP(s) with highest heterozy-

201 Capítulo 2 what changes dramatically is the load estimate for the balanced genes. Also, the Scaled C scores are PHRED-scaled, ranging from 1 to 99, with the top 10% most deleterious variants having scores above 10, and the top 1% above 20, and so on (Kircher et al., 2014). Thus this difference is likely to be even greater than what is conveyed by this analysis.

We also looked at the raw C scores which provide more power in tests com- paring sets of SNPs (Kircher et al., 2014). For African populations, balanced genes have raw C score distributions with significantly higher values than the controls (Mann-Whitney U test, one-tailed, P<=0.05) for more than 70% of the control re-samplings (Table6), except for GWD, for which only 251 out of 1,000 controls have significantly lower C scores than the balanced genes. For Europe, only TSI has 13% of the controls with lower C scores than the balanced genes, and all other populations have less than 5 such cases (Table6).

Pop HLA included HLA excluded P < 0.05 P < 0.05 YRI 959 1,000 LWK 805 1,000 MSL 727 1,000 GWD 251 1,000 ESN 995 1,000 TSI 130 999 GBR 5 995 FIN 1 999 CEU 0 844 IBS 0 1,000

Table 6: Raw C score comparison between balanced genes and controls For each comparison, the alternative hypothesis was that balanced genes had higher raw C score values than the control distribution (Mann-Whitney U test, one-tailed). Values refer to the number of comparisons (out of 1,000 control dis- tributions) for which the null hypothesis (distributions are not different) was rejected (P < 0.05).

202 Capítulo 2

However, when we perform the same analyses for the balanced genes after the removal of HLA SNPs, balanced genes have higher raw C scores for all comparisons in African populations, and for more than 995 comparisons for all European populations, except CEU, for which 844 comparisons are significant (Table6).

Discussion and Conclusions

Increased genetic load in balanced genes

The study of slightly deleterious mutations is one of the pillars of population genetics (Kondrashov, 1995). The fate of mutations is highly dependent on the effective population size (Ne) and its relationship to the selection coefficients. As a consequence, weakly deleterious mutations might reach moderate frequencies in small, but not in large populations, where selection is more effective. More- over, linkage to selected variants is also a major determinant of the fate of a deleterious mutation (Hill and Robertson, 1966; Cutter and Payseur, 2013).

The fates of strongly deleterious mutations are mostly deterministic in terms of mutation rates and selection coefficients – i.e, when s >> 1/2Ne (where s is the selection coefficient). The fate of very slightly deleterious mutations – i.e, almost neutral – is, however, mostly stochastic, driven by genetic drift (Kon- drashov, 1995). But what happens when considerably strong selection (positive, negative, balancing) on a site impacts the sites in its vicinity? Here, we exam- ined how balancing selection shapes the accumulation of deleterious mutations in the vicinity of its targets.

We showed that genes with strong signatures of long-term balancing selec-

203 Capítulo 2 tion have increased levels of nonsynonymous to synonymous polymorphisms, damaging to synonymous polymorphisms, and also elevated deleteriousness scores (Kircher et al., 2014), when compared to controls. We took special care in controlling for the fact that balanced genes have a site-frequency spectrum which is different from the genomic background, with proportionally more in- termediate frequency variants, and we also accounted for the fact that within the balanced genes there are sites directly under balancing selection, which could be incorrectly assigned to deleterious variants according to some clas- sification methods.

Because HLA genes are known as an example of multi-locus balancing se- lection – i.e, several positions within the HLA genes have been targets of selec- tion (Hughes and Nei, 1988; Yang and Swanson, 2002; Bitarello et al., 2015)– it seemed plausible that their considerable contribution to the set of balanced genes could be responsible for the overall patterns we observed. Therefore, in all analyses we compared results for balanced genes including and exclud- ing HLA genes. This approach is conservative, given that not all SNPs in HLA genes are expected to be direct targets of selection. A less drastic solution would be to single out the exons of HLA genes which harbour most – if not all – of the balanced polymorphisms in those genes and exclude only the SNPs contained in those exons (Klein, 1986).

Additionally, we also removed from each balanced gene the SNP(s) likely to be the targets of balancing selection, in order to filter our datasets from the po- tentially conflicting patterns generated by advantageous and deleterious vari- ants within the balanced genes. Importantly, in this approach we only excluded the SNP(s) with the highest heterozigosiy among those contained in a window with a very strong signature of LTBS as reported in Bitarello et al. (n.d.), thus

204 Capítulo 2 increasing the chance that the actual selected site was filtered out.

The challenges of quantifying genetic load

Establishing the damaging potential of a variant is a formidable task in itself (Grimm et al., 2015). Quantifying the genetic load and comparing it between groups (populations, SNPs, genes, etc) is also challenging, as demonstrated by the great number of published contrasting results regarding genetic load in hu- mans (reviewed in Henn et al., 2015). Therefore, it is important to justify the methodology used here.

We chose to use statistics based on the counts of deleterious variants (PN/PS and Pdel/PS) and deleteriousness scores (C score). With PN/PS and Pdel/PS we quantified the proportions of nonsynonymous and potentially damaging vari- ants, respectively, for balanced genes and control groups. PN simply documents whether the polymorphism changes the coded aminoacid, and thus is unbiased with respect to knowledge of the frequency at which the polymorphism is seg- regating. Nevertheless, PN counts are composed of neutral, deleterious and advantageous variants and thus are not straight-forward to interpret. Pdel is more accurate than PN as a measure of deleteriousness, but it is restricted to nonsynonymous sites and is only available for a subset of the nonsynonymous variants (∼ 80% for the genome, but ∼ 70% for balanced genes), thus reducing its power, particularly in small sets of SNPs. Moreover, PolyPhen-2 has been shown to overfit its training data and not to generalize well for other datasets (Grimm et al., 2015). Neither of these approaches incorporate the frequency of the deleterious mutations when classifying SNPs.

One possible frequency-based measure would be the ratio of heterozygosi- ties at nonsynonymous and synonymous sites ( πN/πS ), but this ratio is par-

205 Capítulo 2 ticularly sensitive to recent bottlenecks (reviewed in Brandvain and Wright, 2016). This is because, after a bottleneck, nonsynonymous variation recovers more quickly than synonymous variation (because there are more nonsynony- mous sites), and so an elevated πN/πS following a bottleneck could be inferred (wrongly) as relaxed selection. Do et al. (2015) and Simons et al. (2014) use and recommend a direct estimate of the number of deleterious (or nonsynony- mous) mutations (e.g. PN/PS and Pdel/PS), which is robust to violations of demographic equilibrium (reviewed in Brandvain and Wright, 2016; Henn et al., 2015).

Our approach here is thus conservative. Previous studies have shown that for comparisons between African and Out-of-Africa, small or no difference is verified when the number of putative deleterious mutations is counted (Henn et al., 2016; Tennessen et al., 2012; Do et al., 2015; Lohmueller, 2014), whereas on average the out-of-Africa populations are more homozygous for the putative deleterious mutations (Lohmueller et al., 2008) – a difference not detectable by these two methods.

In addition to this “SNP counting” approach, we also compared the distri- bution of deleteriousness among all SNPs within the balanced genes and con- trols via the C score (Kircher et al., 2014) analyses. The C score is defined for all N + S sites and combines desirable features from other annotations, but is also negatively correlated with allelic frequency (as are the other two statistics used here). The C score does not suffer from poor generalization properties like PolyPhen-2, because the vector machine was trained on an independent dataset (Grimm et al., 2015; Kircher et al., 2014). However, CADD (C score) was trained on high frequency variants, and although the C scores are available for all 1000 Genomes Phase 3 variants (Kircher et al., 2014) its accuracy in differentiating

206 Capítulo 2 deleteriousness of for low MAF variants is likely to be smaller.

Sheltered load and hitch-hiking

Our observations of increased load in genes with signatures of long-term bal- ancing selection can be explained by two possible mechanisms: 1) as a mani- festation of a "sheltered load" (Oosterhout, 2009) and 2) as an effect of linkage of deleterious variants to the balanced polymorphisms, i.e, a hitch-hiking effect (Mendes, 2013; Lenz et al., 2016).

According to the sheltered load model, regions with an excess of heterozy- gosity would "protect" rare recessive variants from being "seen" by purifying selection, thus contributing to their permanence in the population and at higher frequencies than expected if they were not linked to balanced polymorphisms. This model has been invoked to explain the dynamics of deleterious mutations near the S loci of Arabidopsis and Solanum (Stone, 2004; Roux et al., 2013) and the excess of disease associations in the MHC region (e.g. Oosterhout, 2009).

On the other hand, recent work (Lenz et al., 2016) showed through simula- tions that deleterious mutations are expected to accumulate in the vicinity of a locus under balancing selection. The simulation framework assumed that sev- eral sites were under balancing selection in an HLA-like gene – as is the case for classic HLA genes (e.g. Hughes and Nei, 1988; Yang and Swanson, 2002; Bitarello et al., 2015).

Moreover, the simulations assumed symmetrical overdominance and used realistic parameters from the actual HLA genes and/or human demography, such as effective population size, mutation and recombination rates, and even average selection coefficients for these loci. Finally, loci around the selected HLA-like locus were modelled to be either evolving neutrally or under purify-

207 Capítulo 2 ing selection. With these simulations, the authors demonstrated that such a sce- nario leads to an overall reduction of diversity around the HLA-like locus, but the variants that "survive" tend to segregate at higher frequencies, demonstrat- ing the potential for balancing selection in HLA genes to increase the frequency of deleterious variants around the HLA loci. Lenz et al. (2016) confirm this prediction with empirical data an excess of damaging (Adzhubei et al., 2010) variants in non-HLA loci of the MHC region. Importantly, the simulations of Lenz et al. (2016) assume an additive model, not a recessive one. Thus, their observations suggest that some other mecha- nism other than the "sheltered load" is responsible for the increased load in the vicinity of HLA genes, and this is likely to be the hitch-hiking effect mentioned above.

208 References

Abecasis, G. R., A. Auton, L. D. Brooks, M. a. DePristo, R. M. Durbin, R. E. Handsaker, H. M. Kang, G. T. Marth, and G. A. McVean (2012). “An integrated map of genetic variation from 1,092 human genomes.” In: Nature 491 (7422), pp. 56–65. Adzhubei, I. A., S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kon- drashov, and S. R. Sunyaev (2010). “A method and server for predicting damaging missense mutations.” In: Nature methods 7 (4), pp. 248–9. Albrechtsen, A., I. Moltke, and R. Nielsen (2010). “Natural selection and the distribution of identity-by-descent in the human genome.” In: Genetics 186 (1), pp. 295–308. Alkan, C. et al. (2009). “Personalized copy number and segmental duplication maps using next- generation sequencing”. In: Nature Genetics 41 (10), pp. 1061–1067. Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. In: eLS, pp. 1–8. Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” In: Molecular Biology and Evolution 26 (12), pp. 2755–64. Auton, A. et al. (2015). “A global reference for human genetic variation”. In: Nature 526 (7571), pp. 68–74. Betancourt, A. J. and D. C. Presgraves (2002). “Linkage limits the power of natural selection in Drosophila.” In: Proceedings of the National Academy of Sciences of the United States of America 99 (21), pp. 13616–20. Bitarello, B. D., C. de Filippo, J. C. Teixeira, D. Meyer, and A. M. Andrés. “Uncovering targets of balancing selection in the human genome”. In: in prep. Bitarello, B. D., R. D. S. Francisco, and D. Meyer (2015). “Heterogeneity of dN/dS Ratios at the Classical HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”. In: Journal of Molecular Evolution 82 (1), pp. 38–50.

209 Capítulo 2

Brandvain, Y. and S. I. Wright (2016). “The Limits of Natural Selection in a Nonequilibrium World”. In: Trends in Genetics 32 (4), pp. 1–10. Casals, F. et al. (2013). “Whole-Exome Sequencing Reveals a Rapid Change in the Frequency of Rare Functional Variants in a Founding Population of Humans”. In: PLoS Genetics 9 (9). Ed. by S. M. Williams, e1003815. Charlesworth, B. (2012). “The effects of deleterious mutations on evolution at linked sites”. In: Genetics 190 (1), pp. 5–22. Charlesworth, B., M. Nordborg, and D. Charlesworth (1997). “The effects of local selection, bal- anced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided population”. In: Genetical Research 70, pp. 155–174. Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome regions.” In: PLoS Genetics 2 (4), pp. 379–384. Chun, S. and J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the human genome.” In: PLoS genetics 7 (8), e1002240. Connallon, T. and A. G. Clark (2013). “Antagonistic versus nonantagonistic models of balancing selection: characterizing the relative timescales and hitchhiking effects of partial selective sweeps.” In: Evolution; international journal of organic evolution 67 (3), pp. 908–17. Corona, E., J. T. Dudley, and A. J. Butte (2010). “Extreme evolutionary disparities seen in positive selection across seven complex diseases”. In: PLoS ONE 5 (8), pp. 1–10. Cutter, A. D. and B. A. Payseur (2013). “Genomic signatures of selection at linked sites: unifying the disparity among species.” In: Nature reviews. Genetics 14 (4), pp. 262–74. Danecek, P. et al. (2011). “The variant call format and VCFtools”. In: Bioinformatics 27 (15), pp. 2156–2158. Davydov, E. V., D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Batzoglou (2010). “Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++”. In: PLoS Computational Biology 6 (12). Ed. by W. W. Wasserman, e1001025. Development Core Team, R. (2009). R: A language and environment for statistical computing. Vi-

enna, Austria: R Foundation for Statistical Computing. ISBN: 3-900051-07-0. URL: http : //www.r-project.org.

210 Capítulo 2

Do, R., D. Balick, H. Li, I. Adzhubei, S. Sunyaev, and D. Reich (2015). “No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans”. In: Nature Genetics 47 (2), pp. 126–131. Eyre-Walker, A. and P. D. Keightley (1999). “High genomic deleterious mutation rates in ho- minids”. In: Nature 397 (6717), pp. 344–347. Fay, J. C. (2011). “Weighing the evidence for adaptation at the molecular level.” In: Trends in genetics : TIG 27 (9), pp. 343–9. Fu, W. et al. (2012). “Analysis of 6,515 exomes reveals the recent origin of most human protein- coding variants”. In: Nature 493 (7431), pp. 216–220. Grimm, D. G. et al. (2015). “The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity”. In: Human Mutation 36 (5), pp. 513–523. Henn, B. M., L. R. Botigué, C. D. Bustamante, A. G. Clark, and S. Gravel (2015). “Estimating the mutation load in human genomes”. In: Nature Reviews Genetics 16 (6), pp. 333–343. Henn, B. M. et al. (2016). “Distance from sub-Saharan Africa predicts mutational load in diverse human genomes”. In: Proceedings of the National Academy of Sciences 113 (4), E440–E449. Hill, W. G. and A. Robertson (1966). “The effect of linkage on limits to artificial selection”. In: Genetics research 8 (2), pp. 269–294. Hodgkinson, A., F. Casals, Y. Idaghdour, J.-C. Grenier, R. D. Hernandez, and P. Awadalla (2013). “Selective constraint, background selection, and mutation accumulation variability within and between human populations.” In: BMC genomics 14, p. 495. Hughes, A. L. and M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility class I loci reveals overdominant selection”. In: Letters to Nature 335 (8), pp. 167–170. Kiezun, A. et al. (2013). “Deleterious Alleles in the Human Genome Are on Average Younger Than Neutral Alleles of the Same Frequency”. In: PLoS Genetics 9 (2), pp. 1–12. Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, and J. Shendure (2014). “A general framework for estimating the relative pathogenicity of human genetic variants”. In: Nature genetics 46 (3), pp. 310–315. Klein, J. (1986). Natural History of the Major Histocompatibility Complex. New York: John Wiley & Sons, Ltd. Kondrashov, A. S. (1995). “Contamination of the genome by very slightly deleterious mutations: why have we not died 100 times over?” In: Journal of Theoretical Biology 175 (4), pp. 583–594.

211 Capítulo 2

Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582. Lenz, T. L., V. Spirin, D. M. Jordan, and S. R. Sunyaev (2016). “Excess of Deleterious Mutations around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. In: bioRxiv, pp. 1–30. Lohmueller, K. E. (2014). “The distribution of deleterious genetic variation in human popula- tions.” In: Current opinion in genetics & development 29C, pp. 139–146. Lohmueller, K. E. et al. (2008). “Proportionally more deleterious genetic variation in European than in African populations.” In: Nature 451 (7181), pp. 994–997. Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Tech.

rep. Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/ 41/41131/tde-02082013-161104/pt-br.php. Morton, N. E., J. F. Crow, and H. J. Muller (1956). “An Estimate of the Mutational Damage in Man From Data on Consanguineous Marriages”. In: Proceedings of the National Academy of Sciences of the United States of America 42 (11), pp. 855–863. Nielsen, R. (2005). “Molecular Signatures of Natural Selection”. In: Annual Review of Genetics 39 (1), pp. 197–218. Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune genes.” In: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657), pp. 657–65. Peischl, S., I. Dupanloup, M. Kirkpatrick, and L. Excoffier (2013). “On the accumulation of dele- terious mutations during range expansions.” In: Molecular ecology 22 (24), pp. 5972–82. Peischl, S. and L. Excoffier (2015). “Expansion load: recessive mutations and the role of standing genetic variation”. In: Molecular Ecology 24 (9), pp. 2084–2094. Robinson, J., J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and S. G. E. Marsh (2013). “The IMGT/HLA database.” In: Nucleic Acids Research 41, pp. 1222–7. Roux, C., M. Pauwels, M. V. Ruggiero, D. Charlesworth, V. Castric, and X. Vekemans (2013). “Re- cent and ancient signature of balancing selection around the S-Locus in arabidopsis halleri and A. lyrata”. In: Molecular Biology and Evolution 30 (2), pp. 435–447. Schierup, M. H., D. Charlesworth, and X. Vekemans (2000). “The effect of hitch-hiking on genes linked to a balanced polymorphism in a subdivided population”. In: Genetical research 76 (01), pp. 63–73.

212 Capítulo 2

Simons, Y. B., M. C. Turchin, J. K. Pritchard, and G. Sella (2014). “The deleterious mutation load is insensitive to recent population history”. In: Nature Genetics 46 (3), pp. 220–224. Stone, J. L. (2004). “Sheltered load associated with S-alleles in Solanum carolinense.” In: Heredity 92 (4), pp. 335–42. Subramanian, S. (2012). “The abundance of deleterious polymorphisms in humans.” In: Genetics 190 (4), pp. 1579–83. — (2016). “Europeans have a higher proportion of high-frequency deleterious variants than Africans”. In: Human Genetics 135 (1), pp. 1–7. Sunyaev, S., V. Ramensky, I. Koch, W. Lathe 3rd, A. S. Kondrashov, and P. Bork (2001). “Predic- tion of deleterious human alleles”. In: Hum Mol Genet 10 (6), pp. 591–597. Tennessen, J. A. et al. (2012). “Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes”. In: Science 337 (6090), pp. 64–69. Tishkoff, S. A. and S. M. Williams (2002). “Genetic analysis of African populations: human evo- lution and complex disease.” In: Nature Reviews Genetics 3 (8), pp. 611–621. Yang, Z. and W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution that Account for Heterogeneous Selective Pressures Among Site Classes”. In: Molecular Bi- ology and Evolution 19 (1), pp. 49–57.

213 Considerações Finais e Perspectivas

qui, eu recapitulo as questões que propus abordar na Introdução (página 49), resumo as conclusões a que chegamos com as investi- A gações dos Capítulos 1 e 2, e discuto perspectivas decorrentes des- tes trabalhos.

Seleção balanceadora no genoma humano

Desenvolvimento e avaliação de um novo método para a detec-

ção de assinatura de seleção balanceadora

No Capítulo 1, descrevemos um novo método para detecção de assinaturas de seleção balanceadora de longo prazo (SBLP) em humanos: Non-Central Devi- ation (NCD). Esse método apresenta duas estatísticas: NCD1 utiliza apenas o espectro de frequências alélicas, ao passo que NCD2 usa também informação contida na divergência entre humanos e chimpanzés. A combinação de duas assinaturas de SBLP em NCD2 confere maior poder em relação a NCD1. Apesar disso, a performance de NCD1 é comparável à de outros métodos comumente usados para detectar genes ou regiões sob seleção natural e recomendamos que seja utilizada em espécies para as quais dados de divergência com espécies pró-

214 Considerações Finais e Perspectivas ximas não estejam disponíveis.

Através de simulações neutras e com seleção, baseadas num modelo deta- lhado de demografia humana, demonstramos que o poder das duas estatísticas é alto para detectar assinaturas de SBLP em populações africanas e europeias, para sequências não muito longas (<= 6.000 pares de base). Avaliamos qual a combinação de possíveis implementações que maximiza o poder das estatísti- cas e vimos que, para seleção que surgiu há pelo menos 3 milhões de anos, o método NCD2 tem maior poder para sequências de 3.000 pares de base. Para se- leção mais recente (isto é, que teve início há menos de 1 milhão de anos), NCD1 tem poder maior que NCD2, mas como os valores em geral são baixos, na or- dem de 30-40% (para taxa de falso positivo de 5%), enfatizamos que ambas as estatísticas são indicadas para a detecção de eventos de seleção balanceadora que perduram há pelo menos 3 milhões de anos (em humanos). Além disso, mostramos que a performance de NCD2 supera a de outros métodos já existen- tes: D de Tajima (Tajima, 1989), teste HKA (Hudson et al., 1987), testes T1 e T2 (DeGiorgio et al., 2014), e uma combinação de NCD1+HKA.

Um diferencial das nossas estatísticas em relação às já existentes é que nela pode-se definir uma “frequência-alvo” (target frequency) a partir da qual o des- vio de frequências alélicas é calculado. Isto é, a estatística pode ser calculada assumindo-se que os polimorfismos balanceados estejam segregando em frequên- cias diferentes de 0.5. Na avaliação de poder, consideramos as frequências 0.3, 0.4 e 0.5. Com as simulações, vimos que o ganho em poder de NCD1 e NCD2 em relação às outras estatísticas é maior quando frequências de equilíbrio diferen- tes de 0.5 são simuladas. Assim, mostramos que NCD2 tem poder maior do que as outras estatísticas usadas para detectar assinaturas de seleção balanceadora.

É importante ressaltar que o poder foi avaliado no contexto de um modelo

215 Considerações Finais e Perspectivas demográfico para populações humanas que é bastante complexo e realista (Gra- vel et al., 2011). Isso sugere que a observação de que nosso método supera os outros não é restrita a um cenário não-realista, mas baseada nos padrões de polimorfismo previstos em humanos.

Prevalência de SBLP no genoma humano

Há ainda considerável controvérsia sobre a importância da seleção balancea- dora como processo microevolutivo que molda a diversidade genética humana (Andrés et al., 2009; Bubb et al., 2006; Leffler et al., 2013). Ele é raro, envolvendo poucas regiões genômicas? É mais comum atuar em regiões codificadoras de proteína ou regulatórias? Quais funções exercem os genes sob seleção balance- adora? O regime é partilhado entre populações distintas? O método que desen- volvemos tem como motivação contribuir para a resolução dessas questões.

As análises do Capítulo 1 mostram que, para humanos, a performance de NCD2 é melhor do que a de NCD1. Assim, calculamos NCD2 para janelas de 3.000 pares de bases ao longo de todo o genoma. Usamos dados genômicos de 4 populações (duas africanas, duas europeias) do Projeto 1000 Genomas. Como na prática não se sabe em qual frequência os polimorfismos balanceados estão segregando, calculamos NCD2 considerando três frequências-alvo (0.3, 0.4, 0.5) e combinamos os resultados. Tomamos cuidado especial em filtrar os dados a priori, removendo regiões que poderiam ter assinaturas semelhantes às espe- radas sob seleção balanceadora, mas por outras causas. Essas incluem regiões com motivos em tandem, grandes duplicações cromossômicas, regiões que não têm ortologia com chimpanzés e que não são únicas.

A fim de determinar os prováveis alvos de seleção balanceadora, combina- mos duas estratégias: (1) um critério de significância baseado em simulação,

216 Considerações Finais e Perspectivas em que uma janela é considerada significativa se seu valor de NCD2 para uma dada frequência-alvo é menor do que aquele de 10.000 simulações neutras com número igual de sítios informativos (resultando em cerca de 0,50% das janelas por população, considerando a união de todas as frequências-alvo) e; (2) um critério de ranking na distribuição genômica, após a aplicação de uma correção que leva em conta o número de sítios informativos da janela. Com o segundo critério, definimos como outliers as janelas na cauda da distribuição empírica (0,05%), que é basicamente um subconjunto das janelas obtidas com o primeiro critério.

Finalmente, reportamos como genes outlier aqueles que têm pelo menos uma janela outlier (independente da frequência-alvo) em pelo duas populações do mesmo continente. Com isso, esperamos reduzir os falsos positivos que pode- riam ter surgido devido a alguma propriedade dos dados de uma certa popu- lação, dado que, na escala de tempo que investigamos, esperamos que popula- ções de um mesmo continente tenham compartilhado pressões seletivas, bem como história demográfica. Nossos resultados mostraram que pelo menos 1% dos genes do genoma têm assinaturas extremas de seleção balanceadora (ou- tlier), mas talvez mais, podendo chegar até 8% (Tabela S8, Capítulo 1). Mesmo a estimativa mais conservadora de 1% é bem mais alta do que o que já tinha sido observado até hoje. Por exemplo, apenas 0.4% dos 13.500 genes analisados por Andrés et al. (2009) apresentaram fortes assinaturas de seleção balanceadora. O fato de nossa estimativa ser mais alta é provavelmente decorrente de múltiplos fatores: o alto poder de NCD2, os dados genômicos utilizados, o fato de mesmo com todos os nossos filtros termos retido mais de 18.000 genes autossômicos nas análises e o fato de as janelas analisadas serem pequenas, o que aumenta a pro- babilidade de detectar uma assinatura de SBLP (Andrés, 2011; Charlesworth,

217 Considerações Finais e Perspectivas

2009).

Nosso estudo pôde identificar genes com assinaturas extremas e reportou o quão prevalente a seleção balanceadora pode ter sido na história evolutiva humana. Dentre os 213 genes com assinaturas extremas de SBLP, 30% já foram detectados em algum scan prévio e outros (pelo menos quatro) estudos de ge- nes candidatos. Ou seja, cerca de 70% dos genes que apresentamos são novos na literatura de seleção balanceadora. Adicionalmente, a nossa lista mais inclu- siva (i.e, menos conservadora) de 1.470 genes com assinaturas menos extremas indica que talvez a SBLP tenha sido ainda mais comum.

Partilhamento entre continentes

Boa parte dos das janelas candidatas é compartilhada entre ao menos duas das populações analisadas (87%), particularmente entre populações do mesmo con- tinente (78%). Mesmo nos casos em que um gene não passa o critério de perten- cer aos dois continentes, a grande maioria tem assinaturas em ambos os conti- nentes (ou seja, em pelo menos 3 das quatro populações analisadas), com raras exceções. Finalmente, cerca de 32% dos genes outlier (69 genes, Tabela 3, Capí- tulo 1) são partilhados entre as quatro populações.

Nossos achados confirmam que o grau de compartilhamento entre popula- ções de um mesmo continente é maior do que entre populações de continentes distintos. Tal observação pode ser interpretada como um compartilhamento de pressões seletivas históricas, bem como de fatores demográficos em comum, que influenciam a variabilidade genética que fica disponível para a atuação da seleção balanceadora.

O fato de muitos dos alvos – genes e janelas – serem compartilhados entre continentes é compatível com a escala de tempo do regime seletivo que investi-

218 Considerações Finais e Perspectivas gamos (>= 3 milhões de anos). Mesmo que, na história humana recente, África e Europa tenham divergido em diversos aspectos – em termos de história de- mográfica e de pressões seletivas – é plausível que muitos alvos de seleção ba- lanceadora de longo prazo tenham sido mantidos em ambas, e/ou que tenham cessado de ser selecionados em um dos continentes apenas recentemente, pre- servando assim as assinaturas de SBLP até o presente.

Características das regiões candidatas

Resposta imune

Observamos um enriquecimento para certas categorias funcionais entre os ge- nes significativos e outliers. Cerca de metade das categorias enriquecidas são relacionados à resposta imune, de forma ampla, e dessas, cerca de metade é diretamente ligada à apresentação de antígenos por moléculas HLA.

Evidências de seleção balanceadora em diversos genes HLA clássicos de classe I e II são abundantes na literatura. De fato, eles estão contidos nas ja- nelas significativas e também nas outlier, que têm as assinaturas mais extremas. Portanto, investigamos se os genes HLA estariam causando os enriquecimentos de categorias relacionadas ao sistema imune. A remoção de tais genes levou à observação de que nenhuma categoria permaneceu enriquecida para os genes outlier, o que demonstra, em primeiro lugar, a grande influência dos genes HLA no conjunto mais restrito de genes candidatos e, em segundo lugar, que o con- junto de dados restante é pequeno (em média 177 genes por população), o que pode acarretar perda de poder pra testes que visam detectar enriquecimento de alguma classe funcional entre os genes selecionados (mesmo categorias que não são compostas exclusivamente por genes HLA deixam de ser significativas com

219 Considerações Finais e Perspectivas a remoção dos mesmos).

Por outro lado, é interessante observar que mesmo após a remoção dos ge- nes HLA clássicos, algumas categorias funcionais permaneceram enriquecidas para os genes significativos, algumas delas relacionadas ao sistema imune, mas envolvendo outros genes, incluindo genes HLA não-clássicos. De fato, 1/3 dos genes significativos são relacionados a funções imunes, mesmo que não compo- nham categorias enriquecidas. Entre as outras categorias, temos por exemplo “região extra-celular”, que confirma a observação de que tende a haver um ex- cesso de genes relacionados à matriz extracelular entre os alvos de SBLP em humanos (revisado em Key et al., 2014).

Corroboramos, assim, que a resposta imune é uma importante pressão sele- tiva responsável por instâncias de seleção balanceadora, e detectamos fortes as- sinaturas em alguns genes candidatos relacionados à reprodução. Cinco genes significativos são relacionados à espermatogênese, embora não haja enriqueci- mento para a categoria, e um dos 10 genes mais extremos (C1orf101) é altamente expresso em testículo e, embora tenha função ainda desconhecida, há indícios de que poderia estar relacionado ao complexo CATSPER de canais de Ca+2, que são cruciais para a sinalização na superfície celular que leva à fertilização. Em suma, embora estas duas pressões seletivas de inegável importância (defesa do organismo e reprodução) não aparentam estar por trás da maioria dos alvos de SBLP, elas estão envolvidas em mais de 1/3 dos genes com assinaturas mais fortes de SBLP.

Confiabilidade acerca dos alvos de SBLP

Outra categoria de genes enriquecida entre os candidatos são os receptores ol- fatórios. Trata-se de uma família gênica complicada de se analisar pois são o

220 Considerações Finais e Perspectivas resultado de diversas duplicações. Nossas análises não permitem excluir as hi- póteses de que: a) as assinaturas de SBLP nesses genes sejam causadas por con- versão gênica entre parálogos situados próximos uns aos outros (trata-se de um fenômeno biológico, porém diferente de seleção balanceadora, capaz de gerar assinatura semelhante); b) que o excesso de SNPs com frequências intermediá- rias nesses genes seja decorrente de reads de genes distintos porém com alta identidade terem sido mapeados a uma só gene no genoma referência, assim inflando artificialmente a frequência de alelos em frequência intermediárias.

Seria plausível supor que ambos os artefatos – um deles causado por um fenômeno biológico, e o outro por problemas de bioinformática de dados de sequenciamento – poderiam estar ocorrendo de forma mais generalizada nos nossos genes candidatos. A fim de verificar a credibilidade das regiões candi- datas quanto à questão de conversão gênica entre parálogos situados próximos um ao outro, comparamos a distribuição de número de casos em que parálogos por gene candidato estão situados no mesmo cromossomo (possibilitando, as- sim, a conversão não-homóloga), e comparamos com a distribuição para todos os outros genes. Vimos que as distribuições são essencialmente idênticas, e que portanto nossos genes candidatos não tendem a ter mais parálogos situados no mesmo cromossomo, de forma geral. Como a conversão gênica não-homóloga ocorre entre genes homólogos, a proximidade física é necessária.

A respeito de duplicações não detectadas, tomamos quatro dos 10 genes com assinaturas mais extremas (dentre os que nunca apareceram em outros estudos de seleção balanceadora) e verificamos que poucos SNPs contidos nesses genes podem ser artefatos gerados por duplicações não detectadas, e que mesmo ex- cluindo tais SNPs, os genes em questão continuam tendo assinaturas extremas de SBLP. Finalmente, dos 213 genes com assinaturas mais extremas, apenas dois

221 Considerações Finais e Perspectivas são receptores olfatórios (Tabela 3, Capítulo 1), o que implica que: (1) é plausível que não sejam falsos positivos, dados todos os cuidados que tomamos, mas não podemos descartar essa possibilidade; (2) nossas verificações nos deixam confi- antes de que vieses desse tipo não são uma característica dos genes candidatos de forma geral.

Regiões regulatórias versus regiões codificadoras de proteínas

Em um scan para polimorfismos balanceados partilhados entre humanos e chim- panzés, Leffler et al. (2013) reportaram que, de 125 haplótipos compartilhados entre humanos e chimpanzés – interpretado como uma assinatura de SBLP – 123 ocorrem em regiões genômicas não-gênicas. Combinando-se essa obser- vação com o fato de há poucos casos descritos de genes-alvo de SBLP, seria plausível supor que a maior parte dos sítios-alvo de seleção balanceadora fos- sem regulatórios. No nosso estudo, vimos que embora as janelas significativas representem apenas cerca de 0,5% das janelas analisadas, elas correspondem a cerca de 8% dos genes codificadores de proteínas. Por outro lado, não detecta- mos proporcionalmente mais janelas que incluem genes entre as significativas quando comparadas às não-significativas.

A fim de explorar se a SBLP tende a ocorrer sobre sítios regulatórios, inves- tigamos se havia um excesso de SNPs com função regulatória nas janelas signi- ficativas. A princípio vimos que esse excesso – altamente significativo (Figura 7, Capítulo 1) – existe para SNPs que possuem diversas funções regulatórias, inclusive a de eQTL. Entretanto, SNPs sem anotação de eQTL mas com outras funções regulatórias não apresentam enriquecimento. Por fim, pudemos deter- minar que, considerando apenas SNPs com frequência intermediária, não existe enriquecimento para eQTLS, mostrando que uma anotação positiva para eQ-

222 Considerações Finais e Perspectivas

TLs é correlacionada positivamente à frequências dos mesmos. Nosso achado mostra que, como há um excesso de variantes segregando em frequência in- termediária em regiões sob seleção balanceadora, o enriquecimento de traços genômicos para os quais a detecção é sensível à frequência alélica (como é o caso de eQTLs) será enviesado. Finalmente, detectamos um excesso de SNPs sem qualquer anotação de função regulatória nas janelas candidatas.

Apesar dessa ausência de evidência de excesso de enriquecimento para SNPs com funções regulatórias entre as janelas mais extremas, detectamos um sutil, porém significativo, enriquecimento para expressão mono-alélica (MAE) entre os 213 genes com assinaturas mais extremas de SBLP (Savova et al., 2016). Um estudo recente (Savova et al., 2016) reportou que uma proporção considerável dos genes humanos ( 25%) apresentam expressão mono-alélica (MAE)2. Eles reportam, ainda, que dentre os genes que têm assinaturas de SBLP, existe um enriquecimento de genes com assinatura MAE. Nós confirmamos essa relação com o nosso achado de excesso de genes MAE entre os genes mais extremos.

Trata-se de um achado que, conforme argumentado por Savova et al. (2016), pode indicar uma possível ligação evolutiva entre MAE e vantagem do hete- rozigoto: muitos dos genes MAE codificam proteínas expressas na superfície celular, e modulam interações entre a célula e o ambiente ao redor, incluindo outras células. Heterozigose em um sítio MAE poderia levar a diferentes alelos inativados em células de um mesmo tecido, diminuindo a possibilidade de uma “monocultura” e assim reduzindo a susceptibilidade do tecido como um todo a agentes infecciosos (Savova et al., 2016). Por outro lado, Savova et al. (2016)

2Para a maioria dos genes, em organismos diploides, acredita-se que a expressão gênica ocorre simultaneamente para os dois alelos. Para outros, apenas um dos alelos, o materno ou o paterno, é expresso, ao passo que o outro é inativado. Esse padrão é alcançado através de modificações epigenéticas, assim levando a uma expressão mono-alélica que é mantida ao longo das divisões mitóticas.

223 Considerações Finais e Perspectivas discutem que é inteiramente possível que expressão mono-alélica e manuten- ção de diversidade através de seleção sejam fenômenos independentes que têm como alvo os mesmos componentes moleculares.

Finalmente, para alguns alvos de SBLP já foram reportados casos em que uma variante causa uma mudança de tecido em que o gene é expresso. Como exemplo temos o gene B4galnt2: em camundongos, uma variante causa a mu- dança de expressão do local habitual (epitélio intestinal) para outro (endotélio vascular). O ortólogo desse gene em humanos (B4GALNT2) é um dos nossos ge- nes candidatos, discutidos no Capítulo 1. Outro exemplo é o HLA-G, também entre os nossos candidatos e com uma ampla literatura descrevendo padrões complexos de expressão (p.ex. Tan et al., 2005). Assim, testamos se tais pa- drões são recorrentes entre nossos genes e detectamos um excesso significativo de genes com expressão em apenas um tecido humano: 12 com expressão na glândula adrenal e 25 com expressão no pulmão.

Em suma, muitos dos alvos de SBLP são genes codificadores de proteínas – a maioria nunca foi reportada antes em estudo de seleção balanceadora – e não encontramos evidência de excesso de funções regulatórias entre as janelas que não incluem genes. Por outro lado, encontramos enriquecimento para genes com MAE e com expressão tecido-específica, apontando que talvez haja, sim, um excesso de alvos de SBLP com funções regulatórias.

224 Considerações Finais e Perspectivas

Variação deletéria em regiões e genes com assinaturas de SBLP

Além da seleção balanceadora e da seleção positiva, seleção contra mutações deletérias constitui um processo evolutivo fundamental, capaz de influenciar a variação quantitativa para caráteres de importância ecológica e médica. Com o influxo constante de novas mutações deletérias que surgem nas populações, algumas irão segregar transitoriamente dentro das populações, resultando num balanço entre mutação e seleção que é influenciado pela taxa de mutação, pelo tamanho populacional efetivo e pela intensidade de seleção sobre a mutação. Entretanto, a contribuição de tais variantes deletérias sobre caráteres moldados por variação genética quantitativa permanece pouco compreendido (Mitchell- Olds et al., 2007). Diversos estudos de associação em humanos têm identificado polimorfismos segregando em frequências intermediárias que influenciam va- riação de traços complexos (Mitchell-Olds et al., 2007)3.

No Capítulo 2, mostramos que genes com assinaturas extremas de seleção balanceadora têm maior carga genética do que regiões evoluindo presumivel- mente de forma neutra. Os controles levaram em conta o fato de que o espectro de frequências alélicas dos genes balanceados tem proporcionalmente menos variantes raras do que o controle genômico. Usamos três métricas diferentes para quantificar este excesso: duas delas contam diretamente o número de va- riantes potencialmente deletérias dividido pelo número de variantes neutras, e a outra atribui uma medida para cada variante, que quantifica o quão dele-

3Aqui, refiro-me a traços que, acredita-se, resultam de variação genética em múltiplos genes e suas interações com fatores ambientais e comportamentais (Mitchell-Olds et al., 2007).

225 Considerações Finais e Perspectivas téria ela é. Assim, as distribuições dessas medidas para genes balanceados e controles pôde ser comparada.

As três estimativas são mais elevadas para os genes balanceados do que para os controles, com poucas exceções. Mais ainda, quando removemos os genes HLA – que têm muitos sítios mantidos de forma adaptativa e poderiam con- fundir a interpretação das estimativas – os resultados foram qualitativamente semelhantes. Avaliamos, por fim, o impacto que os sítios potencialmente se- lecionados nos genes balanceados têm sobre essas estimativas, e vimos que as observações se mantêm mesmo quando eles são removidos.

Em suma, há evidência, através de três diferentes métricas, de um excesso de carga genética na vizinhança de regiões com assinaturas de seleção balan- ceadora. Esse resultado pode ser interpretado de duas formas: (1) como uma evidência de sheltered load4; ou (2) como evidência de efeito carona das variantes deletérias com os polimorfismos balanceados, conforme explicado na Figura 1 do Capítulo 2.

Nossos resultados não permitem escolher entre uma ou outra explicação. Entretanto, Lenz et al., 2013 mostrou, através de simulações de genes HLA com múltiplos sítios selecionados e suas regiões adjacentes, que mesmo em um mo- delo aditivo (não-recessivo), espera-se um aumento da carga genética em re- giões adjacentes aos genes HLA, e tal efeito diminui quanto maior é a distância em relação aos genes. Se extrapolarmos essas observações para outros genes sob seleção balanceadora, é plausível supor que o mesmo ocorre na vizinhança de outros alvos de seleção balanceadora.

A fim de discernir entre esses dois possíveis cenários, uma opção seria : (1)

4A ideia de que variantes deletérias recessivas raramente estarão em homozigose quando es- tão nos genes HLA, pois a região tem alta heterozigose. Assim, tais variantes deletérias estariam protegidas da seleção purificadora (Oosterhout, 2009).

226 Considerações Finais e Perspectivas verificar com simulações se sob modelo de seleção balanceadora não com múl- tiplos, mas apenas um, sítio selecionado, os mesmo padrões são observados e; (2) se existe um excesso de associações a doenças nas regiões genômicas dos genes sob seleção balanceadora; (3) se o excesso de carga genética é menor (mas ainda significativo) para genes vizinhos aos genes balanceados e/ou fixando-se janelas genômicas em torno dos genes e verificando se a carga genética diminui com a distância em relação ao gene-alvo. Ainda que permaneçam algumas questões em aberto, nosso trabalho é uma contribuição para dois campos estimulantes da biologia evolutiva: o estudo do acúmulo de mutações deletérias no genoma humano e o estudo da importância evolutiva da seleção balanceadora para a evolução humana.

227 Considerações Finais e Perspectivas

Perspectivas

Conciliando assinaturas de seleção e fenótipos

“(...) genome-wide scans are a hatchet, whereas what we need now is a scal- pel. In-depth follow-up studies of individual outlier loci can be one such scalpel, more precisely defining important population genetic parameters such as the timing and magnitude of selection, the geographic distribu- tion of selected variation, the interaction of population demograhic history, recombination, and selection in shaping patterns of variation, and the func-

tional form of selection acting on individual outlier loci” (Akey, 2009)

A rigor, evidências de evolução adaptativa não demonstram que uma dada substituição ou polimorfismo é adaptativo ao nível fenotípico, mas indicam a região onde ele provavelmente poderá ser encontrado. Estudos baseados em genética de populações são capazes de identificar genes alvo de seleção, i.e., que evoluíram de forma não-neutra ao longo da história evolutiva humana (Capí- tulo 1), mas não são capazes de fornecer, por si só, informações acerca dos traços fenotípicos que representam os verdadeiros alvos de seleção (Mitchell-Olds et al., 2007). Até o momento, em muito poucos casos conseguiu-se traçar a relação cau- sal entre um polimorfismo e um fenótipo de interesse, pois, tanto na pesquisa quanto na prática clínica, a capacidade de detectar variantes genéticas suplanta, em muito, a habilidade de sistematicamente avaliar os potenciais efeitos de tais variantes (Kircher et al., 2014). Mesmo havendo essa enorme defasagem, com a publicação de novos catálogos de genes/regiões genômicas candidatas à ação da seleção balanceadora, ensaios funcionais têm se tornado mais comuns.

228 Considerações Finais e Perspectivas

Por exemplo, em um estudo elegante, Chakraborty e Fry (2015) mostra- ram como um polimorfismo em um gene pleiotrópico – codificador da enzima aldeído-desidrogenase – é ativamente mantido devido a diferenças no nível de concentração alcóolica em frutas em ambientes diversos ocupados por Dro- sophila. A enzima tem duas funções: metabolismo de etanol e de outros aldeídos decorrentes da fosforilação oxidativa, sendo esta a provável função ancestral e aquela a função derivada. As duas variantes têm aptidões diferentes em dife- rentes hábitats, dependendo do regime alimentar da mosca. Os autores con- seguiram identificar uma substituição de aminoácido responsável pelas duas variantes da enzima, e verificaram a eficácia das duas variantes sobre diferen- tes substratos, assim revelando a aptidão de cada variante em dois tipos de ambientes.

Um outro exemplo é o do gene ERAP2, que codifica uma proteína envolvida na via de apresentação de antígenos pelas moléculas de MHC classe I. Esse gene apresenta assinaturas de SBLP de acordo com nosso estudo (Tabela S8, Capítulo 1) e já tinha sido revelado como candidato por Andrés et al. (2009). Em um es- tudo posterior (Andrés et al., 2010) foi demonstrado que a seleção balanceadora mantém dois haplótipos, A e B, segregando em frequências intermediárias, e que um deles resulta em uma proteína truncada. O estudo mostra, ainda, que homozigotos para esse haplótipo resultam em expressão reduzida de moléculas de MHC de classe I na superfície de linfócitos T. Apesar de a pressão seletiva para a manutenção dessa variante ser ainda desconhecida, o estudo mostrou evidências bioinformáticas, moleculares, celulares e imunológicas que mostram que o gene pode ter sofrido seleção balanceadora, o impacto do provável sítio selecionado sobre a proteína, e uma consequência downstream dessa variação para a apresentação de antígenos.

229 Considerações Finais e Perspectivas

Ainda que elucidar a relação causal entre genótipo e fenótipo como nos exemplos acima esteja além do escopo do presente trabalho, demos importan- tes passos nessa direção ao explorar propriedades das regiões candidatas. No Capítulo 1, dentro dessas limitações, buscamos explorar a base biológica dos al- vos de seleção balanceadora, ao olharmos para as categorias funcionais às quais eles pertencem, para a proporção de sítios codificadores, e dentre esses, os sítios não-sinônimos. No Capítulo 2, analisamos em maior detalhe as propriedades dos sítios contidos nas regiões-alvo de seleção balanceadora. Assim, pudemos testar hipóteses acerca do acúmulo de mutações deletérias em regiões sob se- leção balanceadora e aprofundamos nossa compreensão acerca dos potenciais alvos de seleção balanceadora no genoma humano.

Acreditamos que com o ressurgimento de interesse por alvos de seleção ba- lanceadora em humanos na literatura, muitos dos genes candidatos levanta- dos no nosso trabalho serão alvo de investigação mais detalhada tanto acerca de padrões genômicos como acerca de possíveis efeitos fenotípicos e mutações causais em estudos funcionais.

Potencial das estatísticas NCD em futuros estudos

No Capítulo 1, mostrei que as duas novas estatísticas que propusemos – NCD1 e NCD2 – têm poder elevado em relação a outras estatísticas comumente usadas para a detecção de assinaturas de seleção balanceadora.

Uma limitação no que tange a extrapolação de nossas observações sobre o poder das estatísticas NCD para outras espécies é que as análises de poder re- querem simulações – neutras e com seleção – cujos parâmetros podem variar muito entre espécies. Por outro lado, o trabalho do Capítulo 1 deixa em aberto a possibilidade de que NCD1 e NCD2 sejam utilizados em outras espécies, dada

230 Considerações Finais e Perspectivas a sua extrema facilidade de implementação e interpretação de seus resultados. As simulações para outras espécies são necessárias no sentido de determinar o poder da estatística para o cenário em questão, e também para definir filtros adequados, como os que propusemos na extensa parte de métodos do trabalho. Como exemplo, Teixeira e colaboradores (in prep)5 têm trabalhado em um estudo que discute as potenciais implicações biológicas de alvos de seleção ba- lanceadora nos “grandes símios”6. Tal estudo tem utilizado as estatísticas NCD, valendo-se de modelos demográficos específicos e detalhados para as espécies em questão para avaliar o poder nesses cenários, bem como os filtros apropria- dos. Essa aplicação pra outras espécies mostra o potencial das nossas estatísti- cas de serem utilizas por geneticistas evolutivos interessados em assinaturas de seleção que afetem o espectro de frequências alélicas.

5Sou co-autora deste trabalho. 6Great Apes, incluindo chimpanzé, bonobo, gorila e orangotango.

231 Bibliografia

Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we go from here?” Em: Genome Research 19 (5), pp. 711–722. Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. Em: eLS, pp. 1–8. Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” Em: Molecular Biology and Evolution 26 (12), pp. 2755–64. Andrés, A. M. et al. (2010). “Balancing Selection Maintains a Form of ERAP2 that Undergoes Nonsense-Mediated Decay and Affects Antigen Presentation”. Em: PLoS Genetics 6 (10), e1001157. Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing selection.” Em: Genetics 173 (4), pp. 2165–77. Chakraborty, M. e J. D. Fry (2015). “Evidence that Environmental Heterogeneity Maintains a De- toxifying Enzyme Polymorphism in Drosophila melanogaster”. Em: Current Biology 26 (2), pp. 1–5. Charlesworth, B. (2009). “Effective population size and patterns of molecular evolution and variation.” Em: Nature reviews. Genetics 10 (3), pp. 195–205. DeGiorgio, M., K. E. Lohmueller e R. Nielsen (2014). “A model-based approach for identifying signatures of ancient balancing selection in genetic data.” Em: PLoS genetics 10 (8), e1004561. Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs e C. D. Bustamante (2011). “Demographic history and rare allele sharing among human populations.” Em: Proceedings of the National Academy of Sciences of the United States of America 108 (29), pp. 11983–8. Hudson, R. R., M. Kreitman e M. Aguade (1987). “A Test of Neutral Molecular Evolution Based on Nucleotide Data”. Em: Genetics 116 (1), pp. 153–159.

232 Considerações Finais e Perspectivas

Key, F. M., J. C. Teixeira, C. de Filippo e A. M. Andrés (2014). “Advantageous diversity maintai- ned by balancing selection in humans”. Em: Current Opinion in Genetics & Development 29, pp. 45–51. Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper e J. Shendure (2014). “A general framework for estimating the relative pathogenicity of human genetic variants”. Em: Nature genetics 46 (3), pp. 310–315. Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees”. Em: Science 339 (6127), pp. 1578–1582. Lenz, T. L., B. Mueller, F. Trillmich e J. B. W. Wolf (2013). “Divergent allele advantage at MHC- DRB through direct and maternal genotypic effects and its consequences for allele pool composition and mating”. Em: Proceedings of the Royal Society B: Biological Sciences 280 (1762), p. 20130714. Mitchell-Olds, T., J. H. Willis e D. B. Goldstein (2007). “Which evolutionary processes influence natural genetic variation for phenotypic traits?” Em: Nature reviews. Genetics 8 (11), pp. 845– 856. Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune genes.” Em: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657), pp. 657–65. Savova, V., S. Chun, M. Sohail, R. B. McCole, R. Witwicki, L. Gai, T. L. Lenz, C.-t. Wu, S. R. Sunyaev e A. A. Gimelbrant (2016). “Genes with monoallelic expression contribute dispro- portionately to genetic diversity in humans”. Em: Nature Genetics 48 (3), pp. 231–237. Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly- morphism.” Em: Genetics 123 (3), pp. 585–595. Tan, Z., A. M. Shon e C. Ober (2005). “Evidence of balancing selection at the HLA-G promoter region”. Em: Human Molecular Genetics 14 (23), pp. 3619–3628.

233 Apêndices

234 Apêndice A.1.

Cópia do artigo “Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data”: G3: Genes|Genomes|Genetics (2015), 5(3): 931-941. Neste artigo, eu contribuí com o planejamento das análises e na compreen- são da organização dos dados do Projeto 1000 Genomas. Além disso, realizei alguns dos testes estatísticos e propus a utilização de medidas de desvio de frequência. Finalmente, contribuí com comentários acerca da redação do texto.

235 INVESTIGATION

Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data

Débora Y. C. Brandt,* Vitor R. C. Aguiar,* Bárbara D. Bitarello,* Kelly Nunes,* Jérôme Goudet,† and Diogo Meyer*,1 *Department of Genetics and Evolutionary Biology, University of São Paulo, 05508-090 São Paulo, SP, Brazil, and † Department of Ecology and Evolution, Biophore, University of Lausanne, CH-1015 Lausanne, Switzerland ORCID IDs: 0000-0001-7676-9367 (B.D.B.); 0000-0002-7155-5674 (D.M.)

ABSTRACT Next-generation sequencing (NGS) technologies have become the standard for data generation KEYWORDS in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are NGS known to be problematic when applied to highly polymorphic genomic regions, such as the human leukocyte mapping bias antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to pop- 1000 Genomes ulation genomics analyses, it is important to assess the reliability of NGS data. Here, we evaluate the reliability HLA of genotype calls and allele frequency estimates of the single-nucleotide polymorphisms (SNPs) reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1,and-DQB1). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1092 1000G samples and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele frequencies are estimated with an error greater than 60.1 at approximately 25% of the SNPs in HLA genes. We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates and discuss the outcomes of including those sites in different kinds of analyses. Because the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.

Whole-genome resequencing data for large numbers of human indi- variable position, which constitute the data for downstream analyses viduals, as generated by the 1000 Genomes Project (www.1000genomes. and hypothesis testing. org), provide unprecedented amounts of information about micro- The calling of single-nucleotide polymorphisms (SNPs) and evolutionary processes and demographic histories. Such inferences genotypes and the estimation of allele frequencies from next- rely on either genotypic or allelic frequency information for each generation sequencing (NGS) has undergone rapid development, along with likelihood-based and Bayesian methods created to deal with challenges associated to heterogeneity in read quality and coverage

et al. (Nielsen et al. 2011). In Phase I of the 1000 Genomes Project, geno- Copyright © 2015 Brandt fi doi: 10.1534/g3.114.015784 types were called using a combination of different approaches: rst, Manuscript received December 22, 2014; accepted for publication March 13, 2015; primary call sets were independently generated by different centers with published Early Online March 17, 2015. different sequencing platforms, alignment, and variant calling methods; This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/ then, a consensus SNP call set was generated and made publicly avail- by/3.0/), which permits unrestricted use, distribution, and reproduction in any able (The 1000 Genomes Project Consortium 2012). medium, provided the original work is properly cited. The data generated by the 1000 Genomes Project frequently have Supporting information is available online at http://www.g3journal.org/lookup/ been used to make inferences about evolutionary processes affecting suppl/doi:10.1534/g3.114.015784/-/DC1 our species, including the detection of targets of natural selection Data available in public repositories: https://github.com/deboraycb/reliability_hla_1000g 1Corresponding author: Departamento de Genética e Biologia Evolutiva, Rua do (Hernandez et al. 2011; Ward and Kellis 2012; Andersen et al. 2012) Matão, 277, São Paulo, SP 05508-090, Brazil. E-mail: [email protected] and understanding the genetic basis of complex phenotypes

Volume 5 | May 2015 | 931 236 (Lappalainen et al. 2013). In addition, the detailed catalog of testing for selection, many studies have found strong evidence asso- genetic variation it provides across multiple human populations ciated to the HLA region, using the 1000 Genomes as a source of has been used to understand the processes affecting specific polymorphism data (e.g.,Leffler et al. 2013). genes. All the aforementioned applications of the 1000 Genomes Project Among the well-documented targets of selection is the major his- SNP data in HLA genes are dependent on the reliability of genotype tocompatibility complex region of the human genome, which harbors calls at each SNP. However, no study to date has provided a detailed the highly polymorphic classical human leukocyte antigen (HLA)class survey of the reliability of individual genotype calls and allele fre- I and II loci. The interest in these loci stems from their strong asso- quency estimates at the SNPs in HLA genes, despite their frequent ciation to various autoimmune disorders (Sollid et al. 2014), suscep- usage. We address this issue, discuss likely causes for cases of incorrect tibility and resistance to infection (Chapman and Hill 2012), and genotype calls, and provide a list of reliable sites for the HLA loci in striking signatures of genetic variation indicating strong balancing the 1000 Genomes data. As in previous studies (Erlich et al. 2011; selection (Meyer and Thomson 2001). Such types of investigations Major et al. 2013), we used a dataset in which individuals had their can naturally be extended to the analysis of the 1000 Genomes data, HLA genes genotyped using Sanger sequencing as a gold standard to which provide a rich resource of population genetic variation within benchmark the genotypes called at the 1000 Genomes Project. How- and around HLA genes. ever, differently from these other studies, which were interested in Despite this interest, the use of NGS data for HLA loci is hampered reconstructing the HLA haplotypes using NGS, here we have decon- by a major technical hurdle, which is the mapping of short sequence structed the haplotypes determined from Sanger sequencing data into reads to genes that are both highly polymorphic and which constitute SNPs, and compared genotypes at the SNP level to the 1000 Genomes a multigene family. The high polymorphism may decrease the prob- data. We took advantage of the recent availability of a dataset of ability that short reads will be successfully mapped to the reference Sanger sequencing based HLA genotyping of HLA-A, -B, -C, -DQB1, genome, in the event that the sequenced individual carries a variant and -DRB1 for 930 of the samples from the 1000 Genomes Project that is highly diverged from that used in the index (Nielsen et al. (Gourraud et al. 2014). Our results have implications for other studies 2011). In addition, the fact that many HLA genes have close that use SNP data from the 1000 Genomes in order to estimate allele paralogues increases the chance that a read will map to two or more frequencies. Because HLA loci are the most polymorphic in the human genomic regions, leading it to be discarded from most sequencing genome, they most likely represent the worst case scenario for map- analyses pipelines, and thus decreasing the amount of usable infor- ping bias and, consequently, allele frequency estimation error. mation for genotype calling (Treangen and Salzberg 2012). In previous studies authors explored the applicability of NGS to METHODS genotype the HLA alleles of an individual, where an allele typically is In this study, we compare NGS genotype calls and allele frequency fi de ned as the haplotype determined by a combination of SNPs within estimates reported by the 1000 Genomes Project with those obtained agivenHLA gene [e.g.,Erlichet al. (2011); Major et al. (2013)]. To in a study which used Sanger sequencing to genotype HLA genes. For this end, Erlich et al. (2011) proposed NGS methodologies in which the purpose of our analysis we assembled a dataset comprising the — different steps from sample preparation to haplotype level allele intersection of the 1000 Genomes and Sanger sequencing samples, — calling were adapted to deal with the issues of high polymorphism resulting in 930 individuals from 12 populations. Supporting Infor- and paralogy of HLA genes. In this way, they were able to successfully mation, Figure S1 summarizes the preprocessing of both datasets, validate their methodology in a study of 270 samples that had been which preceded genotype and allele frequency comparisons. typed previously by sequence-specific oligonucleotide hybridization, which they treated as a gold standard dataset. The same gold standard 1000 Genomes dataset (1000G) dataset was used by Major et al. (2013), who also examined the re- SNP genotypes were acquired from the chromosome 6 integrated liability of calling HLA alleles using NGS, but using the 1000 Genomes Variant Call Format (VCF) file from version 3 of the 1000 Genomes alignment data, and showed that this publicly available dataset can Project Phase I data, which is available at ftp://ftp.1000genomes.ebi.ac. be used for this purpose, after appropriate filters (e.g., coverage) are uk/vol1/ftp/release/20110521/ (The 1000 Genomes Project Consor- applied. tium 2012). We selected only SNPs in exons encoding the antigen Both Erlich et al. (2011) and Major et al. (2013) were interested in recognition sites (ARS), which are exons 2 and 3 for HLA-A,-B,and using NGS data to determine HLA alleles. Information regarding HLA -C (Bjorkman et al. 1987) and exon 2 for HLA-DQB1 and -DRB1 alleles is of biomedical relevance because HLA genotypes often are an (Brown et al. 1993). Sites were selected based on the most inclusive important covariate to account for in association studies, and HLA coordinates of the RefSeq database in July 22, 2014 (see File S1). Both typing is critical to hematopoietic transplantation. In this study, how- SNP and sample selection were carried out using VCFtools v0.1.12b ever, we evaluate the quality of SNP level genotype calls from the 1000 (Danecek et al. 2011). Genomes at the HLA genes. The analysis of genotype and allele frequencies for SNPs contained HLA reference panel by Gourraud et al. within HLA genes has proven of great value in biomedical and evo- (2014) (PAG2014) lutionary studies, and the 1000 Genomes dataset is a resource used Gourraud et al. (2014) typed class I HLA-A, -B and -C, and class II recurrently in this context. Examples of the use of HLA SNP data from HLA-DRB1 and -DQB1 genes of 1266 individuals from 14 different the 1000 Genomes Project include: (1) In genome-wide association populations in Africa, Europe, Asia, and America. The HLA sequence- studies (GWAS), SNPs in HLA genes often are associated with phe- based typing was performed with specificpolymerasechainreaction notypes of interest, and it is useful to understand the prevalence of amplification of ARS exons followed by Sanger sequencing. Data are these variants in additional populations; (2) GWAS studies benefit available at the dbMHC Web site (http://www.ncbi.nlm.nih.gov/gv/ from knowledge of the haplotype structure surrounding HLA genes, mhc/xslcgi.fcgi?cmd=cellsearch; Helmberg et al. 2014). which can be inferred from the dense SNP data of the 1000 Genomes Data from Gourraud et al. (2014) are available in the form of HLA for multiple populations (e.g., Hill-Burns et al. 2011); and (3) when allele names per individual. Allele naming for HLA genes follows

932 | D. Y. C. Brandt et al. 237 specificrules(Marshet al. 2010). To summarize, allele names are sequences for ARS encoding exons. Sequences were acquired from composed of a letter indicating the locus, followed by 224numeric the IMGT (i.e., international ImMunoGeneTics information system) fields separated by colons. Each numeric field indicates specificforms database (Robinson et al. 2013), which keeps a well-curated repository of variation: the 1st field distinguishes groups of alleles by serological of all known HLA allele sequences. type, and the following fields distinguish nonsynonymous polymor- Our analysis was restricted to ARS exons because the HLA typing phisms, synonymous polymorphisms, and noncoding differences, re- method used by Gourraud et al. (2014) only probed genetic variation spectively. To obtain SNP genotypes and frequencies from the Sanger in these specific exons. As a consequence, multiple HLA alleles are sequencing data, we converted all allele names to their associated compatible with the sequencing results, because the sites that

Figure 1 Genotype mismatches between the 1000G and PAG2014 datasets. Results per polymorphic site (“Position”) and per individual (930 in total). Individuals are ordered by number of mismatches (individuals with less mismatches on top). Sites are numbered according to their position in ARS exons coding sequence. Dark squares indicate mismatches between genotypes in the two datasets. ARS, antigen recognition sites; HLA, human leukocyte antigen.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 933 238 934 | D. Y. C. Brandt et al. 239 Figure 3 (A) Distribution of coverage (x-axis) at matched and mismatched genotypes; y-axis is the square root of the relative frequency (Mann- Whitney U one-tailed test, P , 10216); (B) Relationship between mean coverage (x-axis) and absolute frequency difference (jFEj, y-axis) between 1000G and PAG2014 (r = 20.11, P = 0.09). All polymorphic sites from HLA-A, -B, -C, -DRB1, and -DQB1 genes are included in both a and b. HLA, human leukocyte antigen. differentiate them are in other exons. This results in what we refer to this article, sites are numbered according to their position in the ARS as an “ambiguous allele call” for an HLA allele (e.g., the allele is iden- exons coding sequences (12546 at the class I loci and 12270 at the tified as BÃ35:03, but we cannot establish whether it is BÃ35:03:01 or class II loci). BÃ35:03:02, or a group of alleles is attributed to an individual, such as BÃ35:02/BÃ35:03/BÃ35:04). Ambiguous allele calls also may happen Allele frequency comparisons when sequencing has low quality at bases that differentiate two alleles. After correcting all possible ambiguities in PAG2014 (as described In addition, there are also genotypic ambiguities, which occur when previously), we calculated allele frequencies for SNPs in both datasets. different pairs of alleles are compatible with the sequencing results. For By comparing the frequency of the reference allele in 1000G to its individuals that bear ambiguous alleles, we created a consensus se- value in PAG2014, we assessed the accuracy of allele frequency quence in which ambiguous sites were reported with both possible estimation. The reference allele was defined as the allele present in the alleles (e.g.,A/T,seeFigure S1). In this way, we incorporate the un- hg19 build of the reference sequence of the human genome. RefSeq certainty associated to the sequence-based typing into downstream IDs of the reference sequences used for each HLA gene are reported analyses. on File S1. Although we cannot rule out technical errors in the Sanger We computed the error in 1000G frequency estimates per site sequencing that generated the PAG2014 data (Gourraud et al. 2014), i (FE ) as follows: we assume that this method provides the most reliable estimate of i HLA alleles (and hence SNP genotypes), and will serve as a standard FEi ¼ fi;1000G 2 fi;PAG2014 to estimate the reliability of genotype calls and allele frequencies for the 1000 Genomes data (De Santis et al. 2013). where fi;1000G and fi;PAG2014 are the frequency of the reference allele at site i in 1000G and PAG2014, respectively. We also computed the Genotype comparisons mean absolute error in frequency estimates per gene as a mean of We initially quantified how well the 1000G and PAG2014 data agreed absolute FEi for all sites within a gene (MAE): with respect to genotype calls. Genotypes at each site in each individual n were compared between the 1000G data and the PAG2014 data, here 1 MAE ¼ f ; 2 f ; n X i 1000G i PAG2014 considered as a gold standard. In the case of sites with ambiguity (e.g., i¼1 T/A) in the PAG2014 data, if one of the two possible alleles matched an allele present in the 1000G, we considered this an allele match and where n is the number of SNPs in the gene. PAG2014 was corrected, by attributing the allele present in the 1000G data to the ambiguous site. After correcting the ambiguous sites in Coverage in 1000G PAG2014, we only considered genotypes to be a match if both alleles Sequencing coverage per individual per site was calculated from the in 1000G were present in the PAG2014 data, at that site. Throughout 1000 Genomes Project phase I BAM files for the low coverage

Figure 2 REF allele frequency per site in each HLA gene in the 1000 Genomes (1000G) and Sanger sequencing (PAG2014) datasets. Continuous line indicates the expected relationship (i.e., no difference) between 1000G and PAG2014. Dashed lines indicate a 60.1 deviation from the expected frequency (as estimated from PAG2014 dataset). MAE (mean absolute error) defined in the section Materials and Methods. Numbers indicate site position in ARS exons sequence. REF, reference; ARS, antigen recognition sites; HLA, human leukocyte antigen.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 935 240 experiments using the genomeCoverageBed program from BED- Testing for mapping bias Tools (Quinlan and Hall 2010). BAM files are available on ftp:// After demonstrating that there is an overestimation of reference allele ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/[sampleID]/alignment/. frequency in the 1000G SNPs (see the section Results), we hypothe- Only low-coverage BAM fileswereusedtoestimatecoveragebe- sized that mapping bias was the underlying cause. To test this hy- cause genotype likelihoods for the data we analyzed (1000 Genomes pothesis, we examined whether reads carrying the alternative allele at Project Phase I integrated VCF files) were estimated from this source. a SNP are less likely to map to the reference genome than reads Genotype likelihoods were estimated from high coverage exome BAM carrying the reference allele. First, for each HLA allele present in the files only for a minority of sites that were exclusively discovered on the PAG2014 dataset, we defined windows of 51 base pairs that were exome experiments, and were not used in the coverage analysis (See centered on each SNP (25 base pairs upstream and 25 base pairs Table S1). downstream of the SNP, including non-polymorphic sites). The set

Figure 4 Difference in reference allele frequency between 1000G and PAG2014, measured by FE (see the section Materials and Methods), at each polymorphic site, in each population. Shades of red indicate overestimation of reference allele frequency and shades of blue indicate underestimation of reference allele frequency in 1000G. Full population names are given in Table S3.

936 | D. Y. C. Brandt et al. 241 of windows centered on a specific SNP was then separated in two allele frequency estimates are as reliable as the ones from the 1000 groups: (i) those that carry the reference allele at the central site and Genomes NGS data, at the same SNPs (see Figure S9). (ii) those that carry the alternative allele at the central site. Next, all windows were compared with the reference genome (hg19) sequence Relationship between sequencing coverage and (the same sequence that was used as an index in the 1000 Genomes genotypic mismatches Project), and the number of mismatches was counted, excluding the To investigate whether low sequencing coverage could explain mismatch at the central SNP. If mapping bias was influencing allele genotype mismatches and deviations from expected allele frequencies, frequency estimates, we expected that, for SNP positions with over- we compared sequencing coverage between mismatched and matched estimation of the reference allele frequency in the 1000G, the alterna- genotypes (Figure 3A) and assessed the relationship between coverage tive alleles would be flanked by additional alternative alleles (and thus and frequency deviation (Figure 3B). have a greater mismatch count against the reference sequence). Sites with mismatched genotypes have on average lower sequencing coverage than sites with matched genotypes (Figure 3A; Mann- 2 RESULTS Whitney U one-tailed test P , 10 16). This is the expected relation- ship if low sequencing coverage explains genotype mismatches Genotypic mismatch frequency between datasets. However, the difference in sequencing coverage We found that, on average, 18.6% of genotypes were mismatched between sites with matched and mismatched genotypes is small (mean between 1000G and PAG2014 when individual genotypes for each coverage in matching genotypes is 1.95, and 1.75 in nonmatching fi site in the ve classical HLA genes were compared, and exons with genotypes, a difference of 6.2%) and has likely achieved very high greater nucleotide diversity tend to have a greater proportion of significance only due to the large number of observations. Similarly, genotype mismatches (Figure S2). We also observed that mismatches correlation between allele frequency deviation and sequencing cover- are specially concentrated on a few sites (Figure 1), with 18.7% of age is weak and not significant (Figure 3B; r = 20.11, P =0.09), fi sites concentrating 50% of the mismatches over the ve loci we although the direction of correlation is in agreement with what would analyzed. be expected if lower coverage explained larger deviations in frequency estimation. We have also investigated the possible effect of the posi- Reference allele frequency accuracy tion of the SNPs relative to exon edges on the allele frequency devia- Accuracy of estimation of allele frequencies in 1000G was assessed tions and found no correlation between those factors (Figure S10). We comparing the observed frequency of the reference allele in the 1000G therefore investigated other factors that may account for errors in data with that of PAG2014, for both the global dataset (consisting of genotype calling. a pooled set of all individuals) and for each population separately (see Figure S3, Figure S4, Figure S5, Figure S6,andFigure S7). We chose Direction of frequency deviation a difference of 0.1 between the frequencies on both datasets as We found that most of the genotype mismatches are caused by “ ” a threshold that determines a large frequency difference. miscalling an alternative allele as a reference allele (Table S2). Fur- For the global dataset (Figure 2) we found that for HLA-A and -C thermore, most deviations in allele frequency estimates are in the most SNPs have similar frequency estimates for 1000G and PAG2014, direction of an overestimation of reference allele frequencies in the with few large deviations (only 9/66 and 8/44 SNPs with absolute 1000 Genomes data (Figure 2). This information is summarized in difference in frequencies (jFEj) larger than 0.1, respectively). The Figure 4, which shows the location and magnitude of frequency devi- HLA-DQB1 locus shows an intermediate proportion of SNPs with ations between the 1000G and PAG2014 data. . large deviations (10/42 SNPs with jFEj 0:1), and HLA-B and The overall shift in the direction of overestimating reference HLA-DRB1 show the greatest proportion of sites with large frequency alleles is summarized in Table 1, which shows the number of SNPs differences between 1000G and PAG2014 (23/64 and 15/35 sites with with more than 0.1 frequency difference in at least two popula- . jFEj 0:1). Overall, the mean absolute difference in frequency be- tions, for each locus. For HLA-A, -B,and-DQB1 most sites with tween SNPs in the 1000G and PAG2014 data are 0.08, and it is greater large frequency differences between 1000G and PAG2014 are at the HLA genes with the greatest levels of nucleotide diversity (HLA- skewed in the direction of overestimating the reference allele 6 B, -DQB1 and -DRB1 all deviate by 0:1). [P = 0.057 for HLA-A and P , 1024 for HLA-B and -DQB1,bi- The proportion of genotype mismatches and allele frequency nomial test for null hypothesis of equal numbers of deviations in deviations per site are highly correlated (Pearson correlation = 0.86, direction of reference (REF) or alternative (ALT)], whereas HLA-C , 216 P 10 ; Figure S8). However, some SNPs with a high proportion and HLA-DRB1 show no evidence for an excess of large deviations of genotype mismatches have well-estimated allele frequencies. One in the direction of reference alleles. example is site 465 at HLA-B, in which 44% of genotypes are mis- matched, but jFEj is only 0.007. Overall, 15 sites have more than 25% mismatched genotypes while showing jFEj , 0:1(seeFigure S8). This is possible when the frequency of genotype errors in which the refer- ence allele is overrepresented is similar to the frequency of errors in n which the alternative allele is overrepresented. Table 1 Number of sites with overestimation of REF or ALT allele frequency in each HLA locus (jFEj . 0.1 in 2 or more populations) Allele frequency at the axiom exome genotyping array – Affymetrix: Because genotyping arrays constitute an additional frequently used A B C DQB1 DRB1 resource to genotype SNPs within HLA genes, playing an important REF 11 30 6 22 11 role in GWAS studies, we also have investigated the accuracy of allele ALT 3 2 3 2 11 frequency estimation from this genotyping technology. We estimated Genomic coordinates of those sites are given in Table S4. HLA, human leukocyte allele frequencies from Axiom Exome data, and we found that those antigen; REF, reference; ALT, alternative.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 937 242 Figure 5 Number of differences to the reference genome at 1860 51-bp windows centered at sites HLA-B 132 and HLA-DQB1 244 with reference (REF) or alternative (ALT) allele at those sites. Windows were defined from all HLA alleles present in the 930 samples from the PAG2014 dataset. HLA, human leukocyte antigen.

Testing for mapping bias alternative alleles in those sites are flanked by additional alternative We hypothesized that the observed reference allele bias was caused by alleles. alowerefficiency in the mapping of reads containing the alternative To test this hypothesis, we aligned sequences of all alleles present allele. This is expected under the assumption that the reads carrying in PAG2014 to the HLA sequences present in the hg19 build of the the alternative allele on average have more differences with respect to reference human genome (the same sequences used for the alignment the reference genome (used by the 1000 Genomes Consortium as the of reads in the 1000 Genomes Project) and defined windows of 51 index to align NGS reads) than reads carrying the reference allele. In base pairs around each SNP. We then quantified the number of differ- this scenario, some sites would have a stronger bias than others if the ences with respect to the reference genome for windows surrounding

Figure 6 Number of differences to the reference genome at 51-bp windows centered at each SNP in the HLA-A, -B, and -DQB1 genes. Windows around each SNP were defined from the set of 1860 alleles present in the 930 samples from the PAG2014 dataset. Next, the set of windows was divided in three groups: those centered on SNPs with overestimated, well estimated and underestimated reference allele fre- quencies (red, yellow and blue boxplots, respectively). Then, each group was divided in two: windows in which the central site contains the reference allele (REF, dark boxplots) and windows centered on an alternative allele (ALT, light colored boxplots). Upper and lower hinges correspond to the 25th and 75th percentiles, horizontal lines represent the median, whiskers are 1.5 times the interquartile range, and outliers are represented by dots. HLA, human leukocyte antigen; SNP, single- nucleotide polymorphism.

938 | D. Y. C. Brandt et al. 243 REF allele frequency in the 1000 Genomes data with respect to PAG2014). In both cases, ALT windows bear more differences to the reference sequence than REF windows. These results support the hypothesis that these sites with poorly estimated allele frequencies have their ALT alleles residing in haplotypes with substantially more differences with respect to the reference genome than haplotypes centered on the REF allele, thus accounting for the observed bias. To gain a broader perspective of this issue, we classified SNPs from the HLA loci with REF allele bias (HLA-A, -B,and-DQB1)intothree categories: (i) sites at which the REF allele frequency was overesti- mated, i.e., FE . 0:1(“overestimated”); (ii) sites where the REF allele frequency was underestimated, i.e., FE ,20:1(“underestimated”); and (iii) sites at which allele frequencies were well estimated (jFEj , 0:01, here referred to as “well estimated”). We compared these three categories of sites with respect to the number of differences relative to the reference genome in REF and ALT windows (Figure 6). We found that the overestimated group has significant excess of differences at alternative allele bearing haplotypes. In this group of SNPs, ALT windows have on average 4.4 other differences relative to the reference genome, whereas those centered on the REF allele have 1.9 differences (excess of differences on windows centered on the ALT allele was tested with a one tailed Mann-Whitney U test; P , 10216). Figure 7 Heterozygosity of SNPs at HLA genes estimated from the Sites with well estimated or underestimated REF allele frequency, on PAG2014 dataset. Orange bars show distribution of heterozygosity at the other hand, do not show a similar excess of differences in the . sites with a high error rate in frequency estimation (jFEj 0:1 in two or haplotypes bearing the ALT allele, although the difference between more populations). Blue bars show the distribution of heterozygosity REF and ALT windows is statistically significant because of the large after exclusion of SNPs with high error rate. SNP, single-nucleotide sample size (well estimated: ALT mean = 1.7; REF mean = 1.8; one polymorphism; HLA, human leukocyte antigen. tailed Mann-Whitney U test P , 10216; underestimated:ALTmean= 1.9 ; REF mean = 1.2; one tailed Mann-Whitney U, P , 10216). (i) REF and (ii) ALT alleles. If REF allele mapping bias is driving errors in frequency estimation, it is expected that sites with an over- Impact of biases in frequency estimation to population estimation of REF allele frequency would present the following pat- genetic statistics tern: windows carrying the REF with fewer differences to the reference Our analysis was able to identify a subset of SNPs in the HLA genes genome than sequences centered on the ALT alleles. For sites with for which genotype calls and allele frequency estimates from the well-estimated frequencies, on the other hand, we did not expect such 1000G showed a high error rate with respect to the PAG2014 dataset. a difference between REF and ALT windows. To evaluate the impact of the errors introduced by including these To illustrate this effect, Figure 5 shows the results for the two most sites in population genetic analyses, we compared the distribution of extreme cases of frequency deviation shown in Figure 4: site 244 of sample heterozygosity between the sites with low and high error rates. HLA-DQB1 and site 132 of HLA-B (0.56 and 0.52 absolute increase in Heterozygosity is defined as H ¼ 2pð1 2 pÞ for biallelic loci, as is the

Figure 8 Relationship between SNP heterozygosity (H) and (A) absolute value of deviation (jFEj; Pearson’s correlation = 0.32; P = 1.938 · 1027)or (B) magnitude and direction of deviation (FE; Pearson’s correlation = 0.59; P , 10216). SNP, single-nucleotide polymorphism.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 939 244 case for the 1000 Genomes Phase I SNPs, because tri- or quad-allelic et al. (2012) or Dilthey et al. (2014)], it would be possible to improve SNPs were not reported on Phase I. genotype calling and allele frequency estimates. The removal of sites with poor frequency estimates (jFEj . 0:1in In our study, HLA-A, -B,andDQB1 show evidence of REF allele at least two populations) results in a marked change in the distribution mapping bias. The HLA-DRB1 locus, on the other hand, did not of H, with a significant drop in the frequency of sites with large H and present REF allele frequency overestimation, a finding that can be a shift in the distribution toward lower values (Figure 7). Note that the explained by the existence of multiple copies of this gene (both pseu- H values in Figure 7 are estimated from the PAG2014 data, implying dogenes and functional copies), which may result in biases/errors that that the high values of H among “excluded” sites are not due to the make REF allele bias comparatively less visible (Degner et al. 2009). deviations in allele frequencies generated by NGS errors, but are the The HLA-C locus also shows a weaker REF allele bias, a pattern that true heterozygosities at those sites. These results therefore document may be explained by its lower degree of polymorphism which leads to that because sites with high heterozygosity tend to have greater devia- a decrease in the number of mismatches of reads with respect to the tions from the “true” frequency (i.e., based on the PAG2014 dataset), reference genome, thus decreasing the mapping bias. the removal of poorly estimated sites results in a reduction in H We provide a list of unreliable SNPs within the HLA genes, defined values. by us as those with an absolute difference in frequency larger than 0.1 (jFEj . 0:1) in two or more populations (Table S4). We show that The effect of heterozygosity on allele frequency these unreliable SNPs on average have greater heterozygosities in our estimation bias gold standard dataset. As a consequence, although filtering out those We found an overall positive correlation between SNP heterozygosity unreliable sites improves the overall accuracy in allele frequency esti- and the magnitude of error in allele frequency estimates (Figure 8A; mation, it leads to an underestimation of the mean heterozygosity of Pearson’s correlation = 0.32; P =1.938· 1027). This result provides SNPs in HLA genes, a bias that should be taken into account in down- further evidence that sites with greater heterozygosity tend to have stream analyses. Analyses that require genotype calls at the individual poorer estimates for allele frequencies in the 1000G. Also, heterozy- level, including haplotype-based analyses, should be performed with gosity is even more strongly correlated to the deviation in frequency, caution when using the data from the 1000 Genomes at HLA genes. considering the direction of the deviation (Figure 8B; Pearson’scor- Our results have implications to studies that use SNP data from the relation = 0.59; P , 10216). Together, these results show that HLA 1000 Genomes in other genomic regions with high variability, such as SNPs with greater heterozygosities not only have more errors in fre- KIR and olfactory receptors. Because HLA loci are the most polymor- quency estimation but also a stronger bias toward overestimation of phic in the human genome, they represent a worst case scenario for REF allele frequency. mapping bias and subsequent allele frequency estimation errors. We found a significant correlation between SNP heterozygosity and the DISCUSSION absolute difference in frequency between 1000 Genomes data and our The 1000 Genomes Project data were generated by various sequencing gold standard. This suggests that in genome-wide studies, SNPs with centers, which relied on different sequencing platforms, read lengths, high heterozygosities, and contained within regions with additional aligners and variant and genotype calling algorithms (The 1000 SNPs, have an increased chance of presenting poor frequency estimates. Genomes Project Consortium 2012), creating challenges to an overall fi assessment of data reliability. In this study, we speci cally examine the ACKNOWLEDGMENTS performance of NGS-based genotype calls and allele frequency esti- This research was financially supported by grants from São Paulo mates for the highly polymorphic and intensely studied classical HLA Research Foundation (FAPESP) and The Brazilian National Council genes. We took advantage of the possibility of comparing downstream for Scientific and Technological Development (CNPq). D.Y.C.B. was genotype calls from the 1000 Genomes and HLA typing based on funded by FAPESP scholarships #2012/22796-9 and #2013/12162-5; Sanger sequencing for the same set of samples to assess data quality V.R.C.A. has a FAPESP grant #2014/12123-2, B.D.B. was funded by and test hypothesis about possible biases. #2011/12500-2 (FAPESP) and #152676/2011-2 (CNPq); K.N. has We show that the 1000 Genomes SNPs called in the HLA genes a FAPESP grant #2012/09950-9; and D.M. has a FAPESP research have many differences at the genotype level, when compared to results grant #12/18010-0 and a CNPq productivity grant #308167/2012-0. obtained using Sanger sequencing. However, considerably high geno- type mismatching is possible with only modest deviations in allele LITERATURE CITED frequencies, and we conclude that for the 1000 Genomes data allele Andersen, K. G., I. Shylakhter, S. Tabrizi, S. R. Grossman, C. T. Happi et al., frequency estimates for SNPs at HLA genes are considerably more 2012 Genome-wide scans provide evidence for positive selection of reliable than the individual genotype calls. genes implicated in Lassa fever. Philos. Trans. R. Soc. Lond. B Biol. Sci. Low coverage did not explain the errors in genotypes and allele 367: 868–877. frequencies in the 1000 Genomes dataset. Instead, we found evidence Bjorkman, P. J., M. A. Saper, B. Samraoui, W. S. Bennett, J. L. Strominger that read mapping bias was responsible for those errors. Mapping bias et al., 1987 Structure of the human class I histocompatibility antigen, is well known for NGS, and highly polymorphic regions such as HLA HLA-A2. Nature 329: 506–512. genes are especially susceptible to its effects (Nielsen et al. 2011), Boegel, S., M. Löwer, M. Schäfer, T. Bukur, J. de Graaf et al., 2012 HLA particularly when a single reference genome is used as an index for typing from RNA-Seq sequence reads. Genome Med. 4: 102. the alignment of NGS reads. In this situation, many true variants fail Brown, J. H., T. S. Jardetzky, J. C. Gorga, L. J. Stern, R. G. Urban et al., 1993 Three-dimensional structure of the human class II histocompat- to be identified because they are present in haplotypes that differ from ibility antigen HLA-DR1. Nature 364: 33–39. the genome used as index, and thus reads generated from these Chapman, S. J., and A. V. S. Hill, 2012 Human genetic susceptibility to regions are not aligned and are lost. Together, these results suggest infectious disease. Nat. Rev. Genet. 13: 175–188. that increasing coverage would not improve allele frequency estimates Danecek, P., A. Auton, G. R. Abecasis, C. a. Albers, E. Banks et al., at those sites if a single reference sequence is still used as index. By 2011 The variant call format and VCFtools. Bioinformatics 27: 2156– mapping to multiple genomes [e.g., using strategies similar to Boegel 2158.

940 | D. Y. C. Brandt et al. 245 De Santis, D., D. Dinauer, J. Duke, H. A. Erlich, C. L. Holcomb et al.,2013 16 Major, E., K. Rigo, T. Hague, A. Bérces, and S. Juhos, 2013 HLA typing (th) IHIW: review of HLA typing by NGS. Int. J. Immunogenet. 40: 72–76. from 1000 genomes whole genome and whole exome Illumina data. PLoS Degner, J. F., J. C. Marioni, A. A. Pai, J. K. Pickrell, E. Nkadori et al., One 8: e78410. 2009 Effect of read-mapping biases on detecting allele-specific expres- Marsh, S. G. E., E. D. Albert, W. F. Bodmer, R. E. Bontrop, B. Dupont et al., sion from RNA-sequencing data. Bioinformatics 25: 3207–3212. 2010 Nomenclature for factors of the HLA system, 2010. Tissue Anti- Dilthey, A., C. Cox, Z. Iqbal, M. R. Nelson, and G. McVean, 2014 Improved gens 75: 291–455. genome inference in the MHC using a population reference graph. bio- Meyer, D., and G. Thomson, 2001 How selection shapes variation of the Rxiv. Available from: http://biorxiv.org/content/early/2014/07/08/006973. human major histocompatibility complex: a review. Ann. Hum. Genet. Accessed March 20, 2015. 65: 1–26. Erlich, R. L., X. Jia, S. Anderson, E. Banks, X. Gao et al., 2011 Next-generation Nielsen, R., J. S. Paul, A. Albrechtsen, and Y. S. Song, 2011 Genotype and sequencing for HLA typing of class I loci. BMC Genomics 12: 42. SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12: Gourraud, P.-A., P. Khankhanian, N. Cereb, S. Y. Yang, M. Feolo et al., 443–451. 2014 HLA Diversity in the 1000 Genomes Dataset. PLoS One 9: e97282. Quinlan, A. R., and I. M. Hall, 2010 BEDTools: a flexible suite of utilities for Helmberg, W., M. Feolo, R. Dunivin, and D. Hoffman, 2014 dbMHC. comparing genomic features. Bioinformatics 26: 841–842. Hernandez, R. D., J. L. Kelley, E. Elyashiv, S. C. Melton, A. Auton et al., Robinson, J., J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham et al., 2011 Classic selective sweeps were rare in recent human evolution. 2013 The IMGT/HLA database. Nucleic Acids Res. 41: D1222– Science 331: 920–924. D1227. Hill-Burns, E. M., S. A. Factor, C. P. Zabetian, G. Thomson, and H. Payami, Sollid, L. M., W. Pos, and K. W. Wucherpfennig, 2014 Molecular mecha- 2011 Evidence for more than one Parkinson’s disease-associated variant nisms for contribution of MHC molecules to autoimmune diseases. Curr. within the HLA region. PLoS One 6: e27109. Opin. Immunol. 31C: 24–30. Kitts, A., M. Feolo, and W. Helmberg, 2003 The major histocompatibility The 1000 Genomes Project Consortium, 2012 An integrated map of genetic complex database, dbMHC. In: National Center for Biotechnology In- variation from 1,092 human genomes. Nature 491: 56–65. formation NIH, ed. The NCBI Handbook. Bethesda: National Center for Treangen, T. J., and S. L. Salzberg, 2012 Repetitive DNA and next-generation Biotechnology Information NIH, p.1–29. sequencing: computational challenges and solutions. Nat. Rev. Genet. 13: Lappalainen, T., M. Sammeth, M. R. Friedländer, P. C. ’t Hoen, J. Monlong 36–46. et al., 2013 Transcriptome and genome sequencing uncovers functional Ward, L. D., and M. Kellis, 2012 Evidence of abundant purifying selection variation in humans. Nature 501: 506–511. in humans for recently acquired regulatory functions. Science 337: 1675– Leffler, E. M., Z. Gao, S. Pfeifer, L. Ségurel, A. Auton et al., 2013 Multiple 1678. instances of ancient balancing selection shared between humans and chimpanzees. Science 339: 1578–1582. Communicating editor: C. R. Marshall

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 941 246 Apêndice A.2.

Cópia do artigo “HLA supertype variation across populations: new insights into the role of natural selection in the evolution of HLA-A and HLA-B poly- morphisms”: Immunogenetics (2015), 67(11):651-663. Neste trabalho, contribuí com scripts em Perl para a realização das permuta- ções descritas no artigo. Além disso, este trabalho tem fortes pontos em comum com o manuscrito apresentado no Apêndice A.4: em ambos, investigamos as unidades de seleção nos genes HLA, ainda que com abordagens bastante dis- tintas. Aqui, trabalhamos com os genes HLA-A e HLA-B em um contexto po- pulacional, e investigamos o papel dos supertipos como unidades de seleção, ao passo que no outro (A.4) usamos abordagens filogenético-comparativas para investigar o papel das linhagens alélicas de HLA como unidades de seleção nos genes HLA de classe I.

247 Immunogenetics DOI 10.1007/s00251-015-0875-9

ORIGINAL PAPER

HLA supertype variation across populations: new insights into the role of natural selection in the evolution of HLA-A and HLA-B polymorphisms

Rodrigo dos Santos Francisco1,2,3 & Stéphane Buhler2,4 & José Manuel Nunes2,5 & Bárbara Domingues Bitarello1 & Gustavo Starvaggi França6,7 & Diogo Meyer 1 & Alicia Sanchez-Mazas2,5

Received: 28 June 2015 /Accepted: 29 September 2015 # The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Supertypes are groups of human leukocyte antigen randomized groups of alleles. At HLA-A, low levels of vari- (HLA) alleles which bind overlapping sets of peptides due to ation are observed at B and F pockets and randomized He and sharing specific residues at the anchor positions—the B and F GST do not differ from the observed data. By contrast, HLA-B pockets—of the peptide-binding region (PBR). HLA alleles concentrates most of the differences between supertypes, the within the same supertype are expected to be functionally B pocket showing a particularly high level of variation. similar, while those from different supertypes are expected Moreover, at HLA-B, the reassignment of alleles into random to be functionally distinct, presenting different sets of pep- groups does not reproduce the patterns of population differen- tides. In this study, we applied the supertype classification to tiation observed with supertypes. We thus conclude that dif- the HLA-A and HLA-B data of 55 worldwide populations in ferently from HLA-A, for which supertype and allelic varia- order to investigate the effect of natural selection on supertype tion show similar patterns of nucleotide diversity within and rather than allelic variation at these loci. We compared the between populations, HLA-B has likely evolved through spe- nucleotide diversity of the B and F pockets with that of the cific adaptations of its B pocket to local pathogens. other PBR regions through a resampling procedure and com- pared the patterns of within-population heterozygosity (He) and between-population differentiation (GST) observed when Keywords HLA . Supertypes . Human populations . Natural using the supertype definition to those estimated when using selection . Pathogens . Adaptation

Diogo Meyer and Alicia Sanchez-Mazas co-supervised the study. Electronic supplementary material The online version of this article (doi:10.1007/s00251-015-0875-9) contains supplementary material, which is available to authorized users.

* Rodrigo dos Santos Francisco 3 Hospital Israelita Albert Einstein, São Paulo, Brazil [email protected] 4 Transplantation Immunology Unit and National Reference Diogo Meyer Laboratory for Histocompatibility, Department of Genetic and [email protected] Laboratory Medicine, Geneva University Hospital, Geneva, Switzerland Alicia Sanchez-Mazas [email protected] 5 Institute of Genetics and Genomics in Geneva (IGE3), Geneva, Switzerland 1 Department of Genetics and Evolutionary Biology, University of São Paulo, São Paulo, Brazil 6 Department of Biochemistry, Chemistry Institute, University of São Paulo, São Paulo, Brazil 2 Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution–Anthropology Unit, 7 Molecular Oncology Center, Sírio-Libanês Hospital, São University of Geneva, Geneva, Switzerland Paulo, Brazil

248 Immunogenetics

Introduction classified HLA alleles into supertypes, defined as groups of alleles sharing chemical properties at the B and F pockets. The The three classical human leukocyte antigen (HLA) class I logic behind the classification is that alleles within supertypes genes, HLA-A, HLA-B, and HLA-C, are extremely polymor- are expected to exhibit widely overlapping peptide repertoires, phic and exhibit thousands of alleles, most of them coding for whereas alleles from different supertypes would more fre- different proteins (2112, 2789, and 1799 HLA-A, HLA-B, quently bind non-overlapping sets of peptides. Supertypes and HLA-C proteins currently defined, respectively) were originally defined by sequencing endogenously bound (Robinson et al. 2015). These molecules play a central role ligands and searching for motifs shared by alleles that bind in the immune response by presenting processed peptides de- similar peptides and by analyzing the three-dimensional struc- rived from proteins of the intracellular environment (including ture of the HLA molecules (Sette and Sidney 1999; Sidney foreign ones derived from intracellular parasites such as virus- et al. 1996, 2008). As a result, four supertypes were described es and some bacteria) to cytotoxic T lymphocytes and also for HLA-A (A1, A2, A3, and A24) and five for HLA-B (B7, functioning as ligands for the killer immunoglobulin-like re- B27, B44, B58, and B62), and they were originally assigned, ceptor (KIR) of natural killer cells (Parham 2005). respectively,to31HLA-Aand57HLA-Balleleswhose Almost all of the HLA class I polymorphisms are clustered peptide-binding specificities were experimentally defined. in exons 2 and 3, which code for the α1andα2 extracellular These alleles were used to construct a reference panel for the domains of the HLA molecule. These domains form a groove- B and F amino acid sequences. A set of 945 HLA-A and like structure known as the peptide-binding region (PBR) HLA-B alleles with unknown binding specificities were then which engages the peptides (Saper et al. 1991). At the DNA checked for matches to the sequences of this panel (Sidney level, the PBR codons exhibit striking features regarding their et al. 2008). Among these 945 previously unclassified alleles, diversity, including a high heterozygosity (Parham et al. 1989; 57 % presented a full match in both B and F pockets to alleles Lawlor et al. 1990; Hedrick et al. 1991) and high rates of non- with known supertype status. Another 23.8 % presented par- synonymous substitutions (Hughes and Nei 1988; Takahata tial matches with residues found in these pockets. et al. 1992). These characteristics contrast with neutral expec- In line with the expectation that supertypes constitute a tations and support the hypothesis that balancing selection has functionally relevant definition of HLA variation, several maintained variation at these codons. The high levels of var- researchers have found that grouping alleles into supertypes iation observed at the sites involved in peptide binding support is useful in disease association studies involving HLA loci a model of host-pathogen coevolution (Apanius et al. 1997), (Alencar et al. 2013; Chakraborty et al. 2013;Cordery which states that the pathogenic microorganisms are the main et al. 2012; Gilchuk et al. 2013;Karlssonetal.2012, evolutionary force shaping HLA variation (Borghans et al. 2013; Kuniholm et al. 2013; Trachtenberg et al. 2003), 2004; Slade and McCallum 1992; Takahata and Nei 1990). allowing large numbers of rare alleles to be grouped ac- Further supporting this hypothesis, several studies have dem- cording to a functional criterion, thus increasing the power onstrated a positive correlation between the diversity level of of the studies. From an evolutionary point of view, natural some HLA genes and the richness of environmental patho- selection is expected to leave a detectable signature on B gens (Prugnolle et al. 2005; Qutob et al. 2011; Sanchez- and F pockets and, consequently, on the genotypes defined Mazas et al. 2012). These results corroborate the idea that by examining HLA variation from the perspective of the codons making up the PBR constitute the main targets of supertypes. For example, under the assumption that balancing selection within HLA genes. However, the analyses pathogen-driven selection shapes supertype frequencies, performed to date generally treat the PBR region as a homo- we expect genetic variation defined at the supertype level geneous block, whereas it is in fact composed of six different to show patterns of polymorphism and differentiation indic- pocket-like structures (A, B, C, D, E, and F). Each pocket ative of balancing selection to a greater degree than varia- accommodates one of the nine amino acid residues of the tion that is not related to supertype definition. The predic- bound peptide (the first, second, third, sixth, seventh, and tion that balancing selection on supertype variation would ninth, respectively) (Saper et al. 1991). Moreover, the binding result in detectable genetic signatures was raised by Sette affinity between a given HLA molecule and a specific peptide and Sidney (1999), who found that Bsupertype frequencies depends on the chemical properties of each PBR pocket were high and fairly conserved among different ethnicities.^ (Saper et al. 1991). In addition, Naugler and Liwski (2008)arguedthatBnatural The strongest interaction between HLA molecules and the selection should favor maximization of the heterozygosity of bound peptides is accounted by the B and F pockets, which allele supertypes instead of the heterozygosity of individual accommodate the second and ninth amino acid residues of the alleles,^ making explicit the hypothesis that supertypes, as peptide, respectively (Saper et al. 1991). As the amino acids defined by B and F pocket variations, constitute the level of composing the B and F pockets play a central role in peptide variation that is the primary target of natural selection in HLA recognitionbytheHLAmolecules, Sidney et al. (1996) genes.

249 Immunogenetics

Both conservation of supertype frequencies between pop- from which we excluded populations presenting (a) an allelic ulations and increased heterozygosity at the supertype level resolution lower than the first two sets of digits (now referred are expected to generate a pattern of low-population differen- to as second field level of resolution), so as to only keep alleles tiation when compared with those observed at the allelic level. differing at the protein level; (b) genotypic ambiguities; and Balancing selection at the supertype level would also enhance (c) deviation from Hardy-Weinberg expectations. This filter- genetic variation at the B and F pockets compared with other ing resulted in a dataset of 6435 and 6409 individuals typed regions of the PBR, increasing the chances of antigen recog- for HLA-A and HLA-B, respectively, belonging to 55 differ- nition by the immune system. However, testing these hypoth- ent populations: seven sub-Saharan African (SSA), two North eses, i.e., comparing population differentiation and variability African (NAF), eight Southwest Asian (SWA), four European defined at the levels of HLA alleles and supertypes, respec- (EUR), 22 Southeast Asian (SEA), four Pacific islanders tively, represents a methodological challenge due to the diffi- (PAC), four Australian aborigine (AUS), two North Asian culties in comparing measures of differentiation and heterozy- (NEA), and two Native American (AME) populations gosity for genetic variants that are defined by different attri- (Supplementary Material Table 1-S). Almost half of these butes (alleles being defined by all variation in the coding re- populations (24 out of 55) had demographic histories indicat- gion, by contrast with supertypes which are defined by a sub- ing that they were likely to have experienced severe founder set of codons). Indeed, because supertypes are sets of alleles, effects (these populations were from Oceania, Taiwan, and the genetic variation defined at the allele level is nested within that Americas). Because such reductions in diversity due to demo- defined at the supertype level. Therefore, heterozygosity at the graphic effects can potentially mask signals of balancing se- supertype level is constrained to be lower or equal to that lection, we carried out all the analyses with both the complete estimated at the allele level. Furthermore, because population set of 55 populations and a reduced set of 30 populations genetic differentiation measured by statistics related to (obtained by excluding those from Oceania, Taiwan, and the

Wright’s FST is strongly determined by intrapopulation vari- Americas). ability (Jost 2008), we expect higher levels of population dif- ferentiation at the supertype level simply because of the de- Supertype definition creased number of supertype variants in comparison to alleles. In the present study, our aim is to investigate whether the We assigned all HLA-A and HLA-B alleles to their specific use of supertype instead of allele definitions at HLA-A and supertype as defined by the classification given in figures 1 HLA-B loci reduces population differentiation and increases (http://www.biomedcentral.com/1471-2172/9/1/figure/F1) heterozygosity, as expected under a model of balancing selec- and 2 (http://www.biomedcentral.com/1471-2172/9/1/figure/ tion acting on supertypes. For the reasons explained above, we F2) from Sidney et al. (2008). The alleles not assigned to control our analyses for the inherent differences in polymor- any supertype were treated in our analyses of population phism between these two kinds of classification. Our approach differentiation and molecular variation in two ways: (a) their consists in producing null distributions for population differ- allele-level definition was used and (b) they were pooled into entiation and heterozygosity by generating randomized sets of groups of Bnon-classified alleles^ (named NCA and NCB for alleles (herein referred to as Brandom supertypes^) that match HLA-A and HLA-B, respectively). We included A*29:01, true supertype sampling properties (i.e., number of supertypes A*29:02, A*29:03, A*30:01, A*30:08, and A*68:06 in the and number of alleles per supertype) without any biological NCA group because of their ambiguous supertype allocation criteria for pooling them together. We also analyze supertype (Sidney et al. 2008), and all B*08 alleles were assigned to the variation at the nucleotide level by partitioning DNA se- NCB group because of their unique PBR structures, which quences into segments corresponding to the different pockets make the peptide-binding profile unpredictable (Sidney et al. within the PBR. Our hypothesis is that the B and F pockets, 2008). which are the major determinants of the peptide-binding spec- ificities and used to define supertypes, constitute the main Population genetic analyses targets of balancing selection and thus retain higher levels of diversity compared to other PBR pockets. We tested the population samples for deviation from Hardy- Weinberg (HW) equilibrium using the Gene[rate] program which tests the null hypothesis of equilibrium on the basis of Materials and methods a log-likelihood ratio test on frequency estimates (both under HW and under a generalized non-HW model) (Nunes et al. Population data 2014;Nunes2014). We wrote R scripts to estimate supertype frequencies by We used a database generated for the 13th International direct counting of alleles, generate summary statistics (number Histocompatibility Workshop (IHWS) (Mack et al. 2006) of alleles (k) and expected sample heterozygosity (He)), and

250 Immunogenetics estimate genetic differentiation between pairs of populations We estimated the nucleotide diversity (π)(Nei1987)per by using GST (Nei and Chesser 1983). Mantel tests (Mantel pocket (i.e., A, B, pooled CDE, and F) for each population 1967) for assessing Pearson’s correlations between genetic (referred to as πtotal). For these four pockets, we also computed distances obtained either from supertype or from allelic data within- and between-supertype nucleotide diversity (referred were carried out using the ade4 R package (Dray and Dufour to as πwithin and πst, respectively), and thus estimated a mea- 2007), and all graphs and other statistical tests (e.g., Wilcoxon sure of among-supertype variation for each pocket, obtained rank sum test) were also generated using R version 3.0.2 using the following formula: (Development Core Team 2011). In box plots, the boxes cor- π −π π ¼ total within ð Þ respond to the interquartile range, the median is the thick line st 1 πtotal inside the box, and whiskers extend up to observations that are outside the box for less than 1.5 times the interquartile range. Total, within- and between-supertype π values were Dots are outliers to these limits. By using Arlequin 3.5 pro- calculated in two ways: (a) by excluding the non- gram (Excoffier and Lischer 2010), we performed a hierarchi- classified alleles and (b) by including the non-classified cal analysis of molecular variance (AMOVA) for each alleles as a single group. As the dataset is limited to al- supertype taken individually by pooling all others into a leles defined at second field level of resolution, no infor- unique group of Bnon-classified alleles^ for the calculations. mation about synonymous polymorphism is available. We In this way, we estimated the diversity among populations addressed this problem by applying the same strategy as

(FST), among populations within geographic regions (FSC), described by Buhler and Sanchez-Mazas (2011), which and among geographic regions (FCT) for each supertype. consisted in treating as missing data the nucleotide posi- tions which were described as synonymous (Robinson et al. 2015). We excluded sites having more than 5 % Testing the molecular variation of the PBR pockets missing data.

We analyzed the molecular variation at each PBR pocket using the coding sequences of the six pockets which Testing genetic differentiation between populations based make up the HLA class I peptide-binding region (A to on supertypes F). The definition of these codons (Table 1)wastaken from Saper et al. (1991). The residues retained for the To test whether the levels of genetic differentiation between analysis of pocket B variability are the ones surrounding populations differed from those expected under the null hy- the rim and constituting the inner wall of the pocket. As pothesis that supertypes are equivalent to random sets of al- the main-chain atoms of pocket B residues 24, 25, and leles, we randomized the assignment of alleles into supertypes

34 are part of the protein backbone, and their side chains and calculated corresponding He and GST values. The ran- are not turned to the pocket area, they are not expected domized assignment of alleles to supertypes was performed to contribute to the chemical properties of the pocket, using two different approaches (for both the complete and the and were not included in the analysis (Saper et al. reduced datasets): 1991;seealsoTable1). The B and F pockets were analyzed individually because of 1. By fixing the number of alleles per supertype to that ob- their central role in engaging peptides and in defining served in the original dataset supertypes. As the C, D, and E pockets jointly make up the 2. Without any constraint on the number of alleles associated central region of the PBR and are shorter compared to other to a specific supertype pockets, we pooled them for the present analysis. The A pock- et was analyzed individually because of its position at one end The randomizations were repeated 10,000 times, and p- of the PBR. values were estimated empirically by determining the number

Table 1 Codon composition of the PBR pockets Pockets Codons Total size in base pairs (bp)

A 5, 7, 59, 63, 66, 99, 159, 163, 167, and 171 30 B 7, 9, 24, 25, 34, 45, 63, 66, 67, 70, and 99 33 C, D, and E 9, 70, 73, 74, 97, 99, 114, 147, 152, 155, 156, 159, and 160 39 F 77, 80, 81, 84, 116, 123, 143, 146, and 147 27

From: Saper et al. (1991)

251 Immunogenetics

of randomized datasets with GST values lower or He values variation being found among populations of different geo- higher than those observed for the true data. graphic regions (FCT>FSC; Table 2). The A1 supertype is represented by a small number of alleles, with one or two alleles in more than half of the populations (Fig. 1b) and only Results and discussion one in 14 of them (Fig. 1c). The A2 and A3 supertypes exhibit more even distributions, half of the populations having fre- HLA-A and HLA-B supertype frequencies and their quencies ranging from 14 to 29 % for A2 and 14 to 32 % for geographic distributions A3 (Figs. 1a and 2). As a consequence, among the HLA-A supertypes, A2 and A3 present either the lowest or no geo-

In a previous study (the only one, to our knowledge, except graphic structure at all (FCT

Fig. 1 Supertype variation, a boxes represent the frequency distributions of populations showing only one allele for the referred supertype (referred of the four HLA-A and the five HLA-B supertypes and the Bnon- to as Bmonomorphic populations^). The light gray section of the bars classified alleles^ NCA and NCB, respectively; b each box represents represents the number of populations where the referred supertype was the distribution of the number of distinct alleles of each supertype per not detected population; and c the dark gray section of the bars represents the number

252 Immunogenetics

Fig. 2 HLA-A supertype frequencies. Heat map summarizing the frequencies of the four HLA-A supertypes and the non-classified alleles (NCAs). Population names are shown on the right

The frequencies of the HLA-A non-classified alleles (NCAs) ranging from 0 to 5.8 % and from 2.9 to 18 % in half of the vary greatly between populations, ranging from 2 to 14 % in populations, respectively (Fig. 1a). Among the five HLA-B half of them (Fig. 1a). The NCA group presents a strong supertypes, B62 presents the highest level of population differ- geographic structure (FCT being twice as much as FSC)and entiation (FST=11.38%,p<0.0001; Table 2), although with no averyhighFST value (almost 16 %) (Table 2). The highest clear geographic structure (FCT

32 % for B44, respectively (Figs. 1a and 3). Both B7 and B44 level of population differentiation than B7 and B44 (FST= are observed in all populations (except B7 in the Yami; Figs. 1c 7.5 %, p<0.0001; Table 2) but no geographic structure (FCT and 3), with large numbers of alleles per population (Fig. 1b, very close to zero; Table 2). Contrasting with what is observed c). By contrast, B58 and B62 exhibit very low frequencies, for the NCA, the non-classified alleles for HLA-B (NCB) are

253 Immunogenetics

Fig. 3 HLA-B supertype frequencies. Heat map summarizing the frequencies of the five HLA-B supertypes and the non-classified alleles (NCBs). Population names are shown on the right

quite frequent, with frequencies ranging from 10 to 17 % in to the analysis, they should not be ignored. They are a conse- half of the populations (Fig. 1a). More than 75 % of popula- quence of the functional supertype classification, and they tions present at least two different NCBs (Fig. 1b), and only were kept to understand exactly how they influence the vari- two populations lack one of these alleles (Fig. 1c). The NCBs ations in HLA-A and HLA-B. As discussed above, the NCA also exhibit a significant geographic structure, although not as consists of a small group of alleles, which reach high frequen- strong as for NCA (Table 2). cies in island populations. On the other hand, NCB is a more In summary, based on the observed data, supertypes can be heterogeneous group appearing in almost all populations. allocated into two main categories: on the one hand, A2, A3, B7, B27, and B44 fit the classical view that supertypes are Heterozygosity and interpopulation differentiation evenly distributed (Figs. 1a, 2,and3), poorly structured geo- graphically (Table 2), and represented by a large number of Using both complete and reduced datasets (see BMaterials and alleles (Fig. 1b, c). On the other hand, A1, A24, B58, and B62 methods^ section), the heterozygosity estimated for the data present a greater frequency variation among populations treated at the allelic level is always larger than that estimated (Figs. 2 and 3 and Table 2), and in some cases significant for the data treated at the supertype level (Table 3). This result geographic structure (i.e., for A1 and B58, both being very is expected because alleles are nested within supertypes, and common in Africa), and are represented by a smaller number the heterozygosity of the latter is thus constrained to be equal of alleles. Although the unclassified alleles have brought noise to or smaller than that of the former.

254 Immunogenetics

Table 2 Supertype differentiation indexes among populations (FST), the higher correlations between alleles and supertypes when among populations within geographic regions (FSC), and among they are taken into account. The difference between alleles geographic regions (F ) CT and supertypes is less pronounced for HLA-A which presents a Supertypes FST FSC FCT a smaller number of alleles per supertype in all populations (Fig. 1b, c). A1 9.95 %*** 2.67 %*** 7.48 %*** A2 4.85 %*** 3.40 %*** 1.51 %* b Patterns of molecular variability for different PBR A3 6.48 %*** 6.48 %*** 0.000 pockets of HLA-A and HLA-B A24 11.14 %*** 6.66 %*** 4.80 %*** NCA 15.90 %*** 4.90 %*** 11.56 %*** Our goal in this part of the study was to test the prediction that B7 5.11 %*** 3.21 %*** 1.97 %* the B and F pockets of the PBR exhibit the highest levels of b B27 7.54 %*** 7.10 %*** 0.47% variation as a consequence of their crucial role in peptide B44 3.21 %*** 1.72 %*** 1.51 %** binding, which is expected to result in a stronger effect of B58 8.34 %*** 2.91 %*** 5.59 %** balancing selection. B62 11.38 %*** 7.35 %*** 4.35 %* We first estimated the global levels of variation at the PBR NCB 7.02 %*** 2.90 %*** 4.24 %*** and observed significantly higher levels of nucleotide diversi- ty (π ) at HLA-B, compared to HLA-A (p<0.0000005; *p<0.01; **p<0.001; ***p<0.0001, where p values refer to the proba- total bility of observing a statistic as extreme under the null hypothesis of no Wilcoxon rank sum test). Moreover, these two genes differ structure in the way molecular variation is distributed among the A, a In italics: Values of FCT>FSC, an indication that most of the variation B, CDE, and F pockets within the PBR (Fig. 5). The rank was found among populations of different geographic regions order of πtotal is pCDE≫pB≫pA>pF, at HLA-A, and pB≫ b Not significant value pF>pCDE≫pA, at HLA-B (where p is an abbreviation for Bpocket^ and ≫ and > indicate greater than and significant, In order to define the degree to which genetic differentia- at the 0.00001 level, and greater than but non-significant dif- tion, measured by GST between populations, was concordant ferences, respectively, according to a Wilcoxon rank sum test; at the supertype and allelic levels, we estimated the correlation Fig. 5). Among the HLA-A pockets, most of the variation is between these measures and tested their significance using found in the CDE pockets, which makes up the central region Mantel tests. The results suggest that when using the complete of the PBR, and significantly less in pB(πtotal values ranging population dataset, the patterns of population differentiation from 0.14 to 0.15 and from 0.11 to 012 in half of the popula- observed at the supertype and allelic levels are very similar, tions, respectively; Fig. 5). The pAandpF pockets exhibit the especially for HLA-A (r=0.956, p<0.0005; Fig. 4a) but also smallest levels of variation (πtotal values ranging from 0.07 to for HLA-B (r=0.75, p<0.0005; Fig. 4b). The removal of the 0.09 in half of the populations; Fig. 5). Among the HLA-B Pacific, Australian, Taiwanese, and Native American popula- pockets, pB exhibits by far the highest variation, with πtotal tions provokes an overall drop of both the GST values and their values ranging from 0.18 to 0.21 in half of the populations, correlations. Despite this decrease, a high-correlation coeffi- whereas the other pockets exhibit a relatively narrow πtotal cient is still observed for HLA-A (r=0.62,p<0.0005; Fig. 4c), distribution (ranging from 0.10 to 012 in half of the popula- whereas the value is much lower for HLA-B (r=0.3, tions; Fig. 5). p<0.0005; Fig. 4d). Because Pacific, Australian, Taiwanese, The hypothesis that the pockets B and F are the main tar- and Native American populations contribute to large differen- gets of balancing selection is thus partially supported for tiation values, lower-correlation coefficients were expected HLA-B, since pB presents by far the highest level of nucleo- after removing them. Furthermore, these populations also ex- tide diversity. Interestingly, van Deutekom and Kesmir (2015) hibit a reduced set of alleles per supertype, which may explain recently showed that changes involving several of the B pocket’s amino acids had a profound impact on peptide- Table 3 Expected heterozygosity (He) of alleles and supertypes binding properties, which corroborates our interpretation. On the other hand, pF, which is not significantly different from pA Loci Dataseta Average allelic He Average supertype He at HLA-A, and from pCDE at HLA-B, does not present an increased value of π which would be an evidence against HLA-A Complete 0.7761 0.6774 total balancing selection. It is important to note that these results HLA-A Reduced 0.8974 0.7504 were obtained independently from the classification of alleles HLA-B Complete 0.8948 0.7577 into supertypes, since the determination of the pockets’ co- HLA-B Reduced 0.9429 0.7766 dons was taken from the classical study of Saper et al. (1991). a Complete dataset, all populations; reduced dataset, excluding Pacific, We also analyzed how the nucleotide diversity was distrib- Australian, Taiwanese, and Native American populations uted between supertypes. Since the supertype categorization is

255 Immunogenetics

Fig. 4 Plots of GST values between populations based on allele (Y axis) and supertype (X axis) frequencies. The correlation (Rxy) and significance were obtained using a Mantel test. Complete dataset, all populations and reduced dataset, excluding Pacific, Australian, Taiwanese, and Native American populations

based on variations of pBandpF, these pockets were expected the assignment of alleles to supertypes was randomized to present more differences between supertypes than the by permuting the supertype labels attributed to each others. This prediction was confirmed for pF at HLA-A and allele motif, as described in the BMaterials and pB at HLA-B (Fig. 6). methods^ section. As the same patterns were obtained As pB presents the highest levels of variation at HLA-B using the two different simulation approaches (see and also accounts for most of the differences between HLA-B BMaterials and methods^ section), we only present the supertypes, we conclude that the variation between HLA-B results for the case without any constraint on the num- supertypes accounts for most of the differences observed be- ber of alleles associated to a specific supertype. tween HLA-B alleles. In other words, alleles classified within For HLA-A, we do not observe any population with a a same HLA-B supertype share more similarities than alleles significant difference in He in contrasts between the real and assigned to different HLA-B supertypes. By contrast, most of random supertype assignments. For HLA-B, 6 out of 55 pop- the differences between HLA-A supertypes lie within pF, the ulations exhibit significantly lower He (permutation-based pocket presenting the lowest πtotal values for this gene. p<0.05) than those acquired via simulations. These six popu- Therefore, at this locus, the supertypes do not account for most lations belong to the reduced dataset. Because the number of of the variation between alleles (Fig. 6). In other words, HLA- populations with individually significant p values in either A presents more variation within than between supertypes. direction (i.e., with significantly lower or greater He compared to the simulated value) is small, we investigated whether the Simulation approach to test selection on supertypes distribution of the p values itself was informative regarding selective effects. To do this, we used an exact binomial test to According to the definition of Sidney et al. (1996), al- assess whether the observed distribution of p values deviated leles included within the same supertype have overlap- from one composed of equal numbers of values on either side ping peptide-binding specificities. To test the effects of of 0.5 (the expected proportion of deviation in either direction the supertype classification on expected heterozygosities under the null hypothesis; Fig. 7). For HLA-A, no significant

(He) and pairwise differentiation (GST), we generated deviation is found (p value>0.05 for both complete and re- null distributions for these two statistics under the hy- duced datasets). For HLA-B, however, a significant skew to- pothesis that alleles within supertypes are a random col- wards p values greater than 0.5 is observed, indicating an lection, with no shared functional attributes. To this end, overall significant excess of populations with lower He than

Fig. 5 Total nucleotide diversity (πtotal) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the total nucleotide diversity per pocket for the populations of the complete dataset

256 Immunogenetics

Fig. 6 Nucleotide diversity between supertypes (πst) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the nucleotide diversity between supertypes per pocket for the populations of the complete dataset those obtained through simulations (p value<0.05 and p value between than within HLA-A supertypes. This indicates that <0.005 for complete and reduced datasets, respectively). HLA-A supertypes are composed of heterogeneous sets of

For both HLA-A and HLA-B, GST values were not signif- alleles with few sequence similarities at pF(Figs.5 and 6), icantly different from those of the randomized data, when which explains the similarity between the results based on the using the complete dataset. This is also true when using the observed and randomized data. On the other hand, HLA-B reduced dataset for HLA-A but not for HLA-B. Indeed, after supertypes appear to be composed of alleles sharing more removing the Pacific, Australian, Taiwanese, and Native sequence similarities, as shown by the molecular analysis of

American populations, the observed GST is higher than 98 % the PBR pockets (Figs. 5 and 6). of the simulations for HLA-B (Fig. 8). This finding differs In summary, HLA-B supertypes are sets of alleles with B from the expectations of Sidney et al. (1996), who predicted pocket resemblances, and these similarities can be interpreted an overall decrease of differentiation at the supertype level. directly in terms of peptide presentation profiles because However, it is in agreement with our description of the ob- HLA-B supertypes exhibit major differences regarding the served data. Indeed, in our simulations, alleles were randomly chemical properties of pB. Thus, our results showing an in- assigned to supertypes, creating randomized supertypes with creased differentiation at the level of HLA-B supertypes are similar contents of common and rare alleles. The common consistent with an effect of natural selection resulting in local alleles are expected to be assigned to different randomized adaptation of populations to different pathogen environments. supertypes in most of the simulations because they are less Through our simulations, the functional grouping of alleles numerous than the rare alleles. Such a pattern is similar to that reflected by the HLA-B supertypes is disrupted, creating ran- described for real HLA-A supertypes, which present a low domized groups in the same way as described for HLA-A. number of common alleles per population (Fig. 1b, c). As The frequent allocation of common alleles into different ran- discussed above, this pattern also explains the high correlation domized supertypes in the simulations thus provokes both an found between GST values measured at the allelic and increase of He and a decrease of population differentiations supertype levels for this locus (Fig. 4). Finally, as also (GST), when compared with the observed data (Figs. 7 and 8). discussed above for the PBR pockets, less variation is found In agreement with this interpretation, the inclusion of the

Fig. 7 P value distributions obtained through simulations for the expected heterozygosity (He). The p value is defined as the proportion of simulated datasets with He larger than the observed He. The results obtained with the complete (top) and reduced (bottom) dataset are shown

257 Immunogenetics

Fig. 8 Simulation results for GST. The red line represents the average observed GST.We calculated the average GST value for each simulated step and then determined the significance as the proportion of simulated values smaller than the observed one. The results with the complete (top)andreduced(bottom) datasets are shown

Pacific, Australian, Taiwanese, and Native American popula- balancing selection, our simulation results reveal that HLA- tions reduces this effect because the patterns of variation at B supertype frequencies do not show a signature of balancing HLA-B for these populations resemble those observed at selection (i.e., we find lower He compared to those of random- HLA-A, with a relatively low number of alleles belonging to ly assigned groups of alleles), implying that each supertype is different supertypes. not maintained at relatively high frequencies in all popula- tions. This result is supported by the geographically heteroge- neous distributions of B58 and B62 (and, to a lesser extent, Conclusions B27) frequencies among populations. Moreover, populations are more differentiated than expected for HLA-B supertypes

The supertype classification of HLA-A and HLA-B alleles has (higher observed GST values than those obtained from ran- been widely used in medical research, with reports suggesting domly assigned groups of alleles). As most of the differences that supertype-level variation explains susceptibility or resis- between HLA-B supertypes lie in the B pocket, this means tance to a series of pathogenic diseases (Alencar et al. 2013; that the differences in HLA-B supertype composition among Chakraborty et al. 2013; Cordery et al. 2012; Gilchuk et al. populations can be interpreted in terms of peptide recognition. 2013; Karlsson et al. 2012, 2013; Kuniholm et al. 2013; Thus, for HLA-B, our results support the idea that populations Trachtenberg et al. 2003). This classification was proposed present more differences in peptide presentation profiles than in the 1990s as an attempt to find, as described by Sette and expected, possibly due to local adaptations to pathogens. Sidney (1999), Bthe common denominators and similarities By contrast, most of the differences between HLA-A al- hidden within this very large degree of polymorphism.^ The leles are not related with differences at the supertype level. same authors also stated that Bthe overall frequency of each of This is supported by our simulation results showing that the these supertypes is remarkably high and fairly conserved randomly assigned groups of alleles often reproduce the ob- among very different ethnicities. Thus, there might be some served patterns of variation and differentiation of HLA-A advantage for human populations to present approximately supertypes. Moreover, HLA-A alleles are more conserved at five to ten main binding specificities and that each one of these the sites involved in peptide binding, suggesting that they is maintained at relatively high frequency.^ According to our present a more conserved profile of peptides across popula- results, the variation among HLA-B supertypes does reflect tions, differing from what is observed for HLA-B. Of note, the functional diversity at this locus and is thus in agreement one possible caveat of inferring peptide binding through the with the above-mentioned hypothesis. Our results strongly supertype classification is that some peptides presented by indicate that the B pocket is likely to be the main target of HLA class I molecules are known to assume a looping con- natural selection at HLA-B, as it presents the highest levels of formation outside the peptide-binding groove. However, no molecular variation and accounts for the main differences in matter how different conformations a peptide can adopt, the the peptide presentation profiles for this gene. However, in anchor amino acids located at the peptide ends remain the contrast with classical expectations for loci evolving under same, limited by the B and F pockets. In this way, this

258 Immunogenetics conformational variability exhibited by the peptides is also a R Development Core Team (2011) R: a language and environment for consequence of the interaction between the peptide anchors statistical computing. Vienna, Austria: the R Foundation for Statistical Computing. ISBN: 3-900051-07-0. Available online at and the B and F pockets and thus is not expected to change the http://www.R-project.org/ results obtained here. Dray S, Dufour AB (2007) The ade4 package: implementing the duality Our results suggest that the B pocket of the HLA-B mole- diagram for ecologists. J Stat Softw 22(4):1–20 cules is the main target of natural selection, whereas no such Excoffier L, Lischer HE (2010) Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and signals could be retrieved for the other HLA-B pockets nor for Windows. Mol Ecol Resour 10(3):564–567 the pockets of the HLA-A molecules in relation to the Gibert M, Sanchez-Mazas A (2003) Geographic patterns of functional supertype classification. This conclusion matches the expec- categories of HLA-DRB1 alleles: a new approach to analyse asso- tations that supertypes are the primary targets of selection for ciations between HLA-DRB1 and disease. Eur J Immunogenet – HLA-B but not for HLA-A. Following this idea, we could 30(5):361 374 Gilchuk P, Spencer CT, Conant SB, Hill T, Gray JJ, Niu X, Zheng M, state that HLA-A supertypes are composed by alleles whose Erickson JJ, Boyd KL, McAfee KJ, Oseroff C, Hadrup SR, Bennink resemblances are not the consequence of a shared phylogenet- JR, Hildebrand W, Edwards KM, Crowe JE, Williams JV, Buus S, ic origin. A future extension of this work could be to explore Sette A, Schumacher TN, Link AJ, Joyce S (2013) Discovering whether the central pockets C, D, and E that have been shown naturally processed antigenic determinants that confer protective T cell immunity. J Clin Invest 123(5):1976–1987 to contain most of the variation at HLA-A could be used as an Hedrick PW, Whittam TS, Parham P (1991) Heterozygosity at individual alternate functional classification for these alleles. amino acid sites: extremely high levels for HLA-A and -B genes. Proc Natl Acad Sci U S A 88(13):5897–5901 Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selec- Acknowledgments This work was supported by the Swiss National tion. Nature 335(6186):167–170 Science Foundation (SNSF) grant no. 31003A_144180 to ASM and Jost L (2008) G(ST) and its relatives do not measure differentiation. Mol São Paulo Research Foundation (FAPESP) 12/18010-0 and a CNPq pro- Ecol 17(18):4015–4026 ductivity grant no. 308167/2012-0 to DM. RSF was supported by CNPq (grant no. 142130/2009-5) and CAPES (grant no. 12447/12-9). We also Karlsson I, Kløverpris H, Jensen KJ, Stryhn A, Buus S, Karlsson A, thank two anonymous reviewers for their useful comments. Vinner L, Goulder P, Fomsgaard A (2012) Identification of con- served subdominant HIV type 1 CD8(+) T cell epitopes restricted within common HLA supertypes for therapeutic HIV type 1 vac- cines. AIDS Res Hum Retroviruses 28(11):1434–1443 Karlsson I, Brandt L, Vinner L, Kromann I, Andreasen LV, Andersen P, Open Access This article is distributed under the terms of the Creative Gerstoft J, Kronborg G, Fomsgaard A (2013) Adjuvanted HLA- Commons Attribution 4.0 International License (http:// supertype restricted subdominant peptides induce new T-cell immu- creativecommons.org/licenses/by/4.0/), which permits unrestricted use, nity during untreated HIV-1-infection. Clin Immunol 146(2):120– distribution, and reproduction in any medium, provided you give 130 appropriate credit to the original author(s) and the source, provide a link Kuniholm MH, Anastos K, Kovacs A, Gao X, Marti D, Sette A, to the Creative Commons license, and indicate if changes were made. Greenblatt RM, Peters M, Cohen MH, Minkoff H, Gange SJ, Thio CL, Young MA, Xue X, Carrington M, Strickler HD (2013) Relation of HLA class I and II supertypes with spontaneous clear- – References ance of hepatitis C virus. Genes Immun 14(5):330 335 Lawlor DA, Zemmour J, Ennis PD, Parham P (1990) Evolution of class-I MHC genes and proteins: from natural selection to thymic selection. Alencar LXE, Braga-Neto UM, Nascimento EJM, Cordeiro MT, Silva Annu Rev Immunol 8:23–63 AM, Brito CAA, Silva PM, Gil LH, Montenegro SM, Marques Mack S, Sanchez-Mazas A, Meyer D, Single R, Tsai Y et al (2006) 13th Júnior ET Jr (2013) HLA-B*44 is associated with dengue severity International Histocompatibility Workshop Anthropology/Human caused by DENV-3 in a Brazilian population. J Trop Med 2013: Genetic Diversity Joint Report—Chapter 2: methods used in the 648475 generation and preparation of data for analysis in the 13th Apanius V, Penn D, Slev PR, Ruff LR, Potts WK (1997) The nature of International Histocompatibility Workshop. In: Hansen J (ed) selection on the major histocompatibility complex. Crit Rev Immunobiology of the human MHC: Proceedings of the 13th Immunol 17(2):179–224 International Histocompatibility Workshop and Conference. Borghans JA, Beltman JB, De Boer RJ (2004) MHC polymorphism IHWG Press, Seattle, pp 564–579 under host-pathogen coevolution. Immunogenetics 55(11):732–739 Mantel N (1967) The detection of disease clustering and a generalized Buhler S, Sanchez-Mazas A (2011) HLA DNA sequence variation regression approach. Cancer Res 27(2):209–220 among human populations: molecular signatures of demographic Naugler C, Liwski R (2008) An evolutionary approach to major histo- and selective events. PLoS One 6(2):e14643 compatibility diversity based on allele supertypes. Med Hypotheses Chakraborty S, Rahman T, Chakravorty R, Kuchta A, Rabby A, 70(5):933–937 Sahiuzzaman M (2013) HLA supertypes contribute in HIV type 1 Nei M (1987) Molecular evolutionary genetics. Columbia University cytotoxic T lymphocyte epitope clustering in Nef and Gag proteins. Press, New York AIDS Res Hum Retroviruses 29(2):270–278 Nei M, Chesser RK (1983) Estimation of fixation indices and gene diver- Cordery DV, Martin A, Amin J, Kelleher AD, Emery S, Cooper DA, sities. Ann Hum Genet 47(Pt 3):253–259 STEAL study group (2012) The influence of HLA supertype on Nunes JM (2014) Using Uniformat and Gene[rate] to analyse data with thymidine analogue associated with low peripheral fat in HIV. ambiguities in population genetics. http://dx.doi.org/10.6084/m9. AIDS 26(18):2337–2344 figshare.984299

259 Immunogenetics

Nunes JM, Buhler S, Roessli D, Sanchez-Mazas A, HLA-net 2013 col- Saper MA, Bjorkman PJ, Wiley DC (1991) Refined structure of the laboration (2014) The HLA-net Gene[rate] pipeline for effective human histocompatibility antigen HLA-A2 at 2.6 A resolution. J HLA data analysis and its application to 145 populations from MolBiol219(2):277–319 Europe and neighbouring areas. Tissue Antigens 83(5):307–323 Sette A, Sidney J (1999) Nine major HLA class I supertypes account for Parham P (2005) MHC class I molecules and KIRs in human history, the vast preponderance of HLA-A and -B polymorphism. health and survival. Nat Rev Immunol 5(3):201–214 Immunogenetics 50(3–4):201–212 Parham P, Benjamin RJ, Chen BP,Clayberger C, Ennis PD, Krensky AM, Sidney J, Grey HM, Kubo RT, Sette A (1996) Practical, biochemical and Lawlor DA, Littman DR, Norment AM, Orr HT et al (1989) evolutionary implications of the discovery of HLA class I Diversity of class I HLA molecules: functional and evolutionary supermotifs. Immunol Today 17(6):261–266 interactions with T cells. Cold Spring Harb Symp Quant Biol Sidney J, Peters B, Frahm N, Brander C, Sette A (2008) HLA class I 54(Pt 1):529–543 supertypes: a revised and updated classification. BMC Immunol 9:1 Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V,Balloux F Slade RW, McCallum HI (1992) Overdominant vs. frequency-dependent – (2005) Pathogen-driven selection and worldwide HLA class I diver- selection at MHC loci. Genetics 132(3):861 864 sity. Curr Biol 15(11):1022–1027 Takahata N, Nei M (1990) Allelic genealogy under overdominant and frequency-dependent selection and polymorphism of major histo- QutobN,BallouxF,RajT,LiuH,MariondeProcéS,Trowsdale – J, Manica A (2011) Signatures of historical demography and compatibility complex loci. Genetics 124(4):967 978 Takahata N, Satta Y, Klein J (1992) Polymorphism and balancing selection pathogen richness on MHC class I genes. Immunogenetics – 64(3):165–175 at major histocompatibility complex loci. Genetics 130(4):925 938 Trachtenberg E, Korber B, Sollars C, Kepler TB, Hraber PT, Hayes E, Robinson J, Halliwell JA, Hayhurst JD, Flicek P, Parham P, Marsh SG Funkhouser R, Fugate M, Theiler J, Hsu YS, Kunstman K, Wu S, (2015) The IPD and IMGT/HLA database: allele variant databases. Phair J, Erlich H, Wolinsky S (2003) Advantage of rare HLA Nucleic Acids Res 43(Database issue):D423–D431 supertype in HIV disease progression. Nat Med 9(7):928–935 Sanchez-Mazas A, Lemaître JF, Currat M (2012) Distinct evolutionary van Deutekom HW, Kesmir C (2015) Zooming into the binding groove of strategies of human leucocyte antigen loci in pathogen-rich environ- – HLA molecules: which positions and which substitutions changes ments. Philos Trans R Soc Lond B Biol Sci 367(1590):830 839 peptide binding most? Immunogenetics 67(8):425–436

260 Apêndice A.3.

Cópia do artigo “Kiwi genome provides insights into evolution of a nocturnal lifestyle”: Genome Biology (2015), 16(1): 1-15. Neste trabalho, eu realizei os testes de seleção baseados em dN/dS usando o pacote PAML – e/ou supervisionei sua execução e interpretação – e fui res- ponsável pela discussão dos resultados referentes a estas análises no artigo. Também fiz parte das análises referentes às regiões ultra-conservadas (Ultra- conserved non-coding elements) que apresentam mais variação do que o esperado em kiwi, indicando possíveis vias de desenvolvimento alteradas nessa espécie. Finalmente, contribuí com correções do manuscrito e com discussões relaciona- das aos aspectos evolutivos do trabalho.

261 Le Duc et al. Genome Biology (2015) 16:147 DOI 10.1186/s13059-015-0711-4

RESEARCH Open Access Kiwi genome provides insights into evolution of a nocturnal lifestyle Diana Le Duc1,2*, Gabriel Renaud2,ArunkumarKrishnan3, Markus Sällman Almén3,LeonHuynen4, Sonja J. Prohaska5, Matthias Ongyerth2, Bárbara D. Bitarello6, Helgi B. Schiöth3, Michael Hofreiter7, Peter F. Stadler5, Kay Prüfer2, David Lambert4,JanetKelso2 and Torsten Schöneberg1*

Abstract Background: Kiwi, comprising five species from the genus Apteryx, are endangered, ground-dwelling bird species endemic to New Zealand. They are the smallest and only nocturnal representatives of the ratites. The timing of kiwi adaptation to a nocturnal niche and the genomic innovations, which shaped sensory systems and morphology to allow this adaptation, are not yet fully understood. Results: We sequenced and assembled the brown kiwi genome to 150-fold coverage and annotated the genome using kiwi transcript data and non-redundant protein information from multiple bird species. We identified evolutionary sequence changes that underlie adaptation to nocturnality and estimated the onset time of these adaptations. Several opsin genes involved in color vision are inactivated in the kiwi. We date this inactivation to the Oligocene epoch, likely after the arrival of the ancestor of modern kiwi in New Zealand. Genome comparisons between kiwi and representatives of ratites, Galloanserae,andNeoaves, including nocturnal and song birds, show diversification of kiwi’s odorant receptors repertoire, which may reflect an increased reliance on olfaction rather than sight during foraging. Further, there is an enrichment of genes influencing mitochondrial function and energy expenditure among genes that are rapidly evolving specifically on the kiwi branch, which may also be linked to its nocturnal lifestyle. Conclusions: The genomic changes in kiwi vision and olfaction are consistent with changes that are hypothesized to occur during adaptation to nocturnal lifestyle in mammals. The kiwi genome provides a valuable genomic resource for future genome-wide comparative analyses to other extinct and extant diurnal ratites.

Background in New Guinea, and the rhea in South America, and, as New Zealand’s geographic isolation, after the separation extinct members, the moa from New Zealand and the from Gondwana around 80 million years ago, provides elephant birds from Madagascar. New Zealand is thus an unequaled opportunity to study the results of evolu- the only landmass to have been inhabited by two ratite tionary processes following geographic isolation. In New lineages. Strikingly, the two lineages are highly divergent Zealand, the ecological niches typically occupied by in size with moa having a body size of up to 3 m [1] mammals in most other parts of the world are domi- while kiwi, the smallest of the ratites, reaches only the nated by birds. Kiwi (genus Apteryx), the national size of a chicken. Moreover, while moa occupied the di- symbol of New Zealand, belong to a group of flightless urnal niche, kiwi are the only ratites, and one of only a birds, the ratites. This group is geographically broadly few bird lineages (less than 3 % of the bird species [2]), distributed including both extant members, which are that are nocturnal. Although the kiwi eye is unusually the ostrich in Africa, the emu in Australia, the cassowary small for a nocturnal bird, it has a nocturnal-type retina [3]. This may indicate that the nocturnal adaptation of

* Correspondence: [email protected]; [email protected]. kiwi is recent, or alternatively, that changes in eye size de are not a prerequisite for nocturnality. 1Institute of Biochemistry, Medical Faculty, University of Leipzig, Johannisallee We have sequenced and assembled the genome of Ap- 30, Leipzig 04103, Germany Full list of author information is available at the end of the article teryx mantelli, the North Island brown kiwi, to improve

© 2015 Le Duc et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http:// creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

262 Le Duc et al. Genome Biology (2015) 16:147 Page 2 of 15

our understanding of how genomic features evolve during is syntenically alignable to 83.51 % of the chicken genome. adaptation to nocturnality and the ground-dwelling niche. However, 91.96 % of the zebra finch sequences that are We have also sequenced the transcriptome from embry- syntenic-chain-alignable to chicken showed conserved onic tissue to provide support for the genome annotation. synteny in kiwi, suggesting that the kiwi genome assembly We identified genomic changes in kiwi that affect physio- includes the majority of conserved regions between birds. logical functions, including vision and olfaction, which We identified a set of 27,876 genes following de novo have been predicted to characterize nocturnal adaptation gene prediction on the assembled genome (Additional in the early history of mammals [4]. file 1: Note: De novo gene prediction and gene annota- tion). To refine these gene annotations we used 47.5 Gb Results of transcript sequence data from kiwi embryonic tissue Genome sequencing, assembly, and annotation together with the de novo gene predictions and protein We prepared 11 libraries with several insert sizes from evidence from three well-annotated bird species (G. Apteryx mantelli genomic DNA and sequenced 83 billion gallus, T. guttata, M. gallopavo) as input to the MAKER base pairs (Gb) from small insert-size libraries and 120 Gb genome annotation pipeline [10]. A validated set of from large-insert mate-pair Illumina libraries (Additional 18,033 genes was selected based on their alignment to file 1: Table S1). After read correction [5] we assembled orthologous genes in other birds and on supporting evi- contigs and scaffolds using SOAPdenovo [6] (Additional dence provided by kiwi transcript sequences. In total, file 1: Note: Filtering and read correction; Genome assem- the gene models spanned 306.62 Mb of the assembly, bly) to generate a draft assembly, which spanned 1.595 Gb with exons accounting for 23.96 Mb (approximately 1.6 (Additional file 1: Tables S2 and S3). The N50s of contigs %) of the total kiwi genome. and scaffolds were 16.48 kb and 3.95 Mb, respectively (Additional file 1: Table S3). Since the size of the kiwi gen- Evolution of gene families ome is unknown, we estimated average coverage using a Gene family expansion and/or contraction have been 19-mer frequency distribution (Additional file 1: Figure proposed as important mechanisms underlying adapta- S1) which yielded a genome size estimate of 1.65 Gb, pla- tion [11]. We explored patterns of protein family expan- cing the kiwi among the largest bird genomes sequenced sions and contractions in kiwi and used TreeFam [12] to to date [7] (Table 1; Additional file 1: Table S4). The as- define gene families in the kiwi and all bird and reptile sembled contigs and scaffolds cover approximately 96 % genomes in Ensembl 73, as well as two nocturnal birds of the complete genome with an average sequence cover- (barn owl, chuck-will’s-widow), two other ratites (ostrich, age of 35.85-fold after correction (Additional file 1: Note: tinamou) [7] (GigaDB [13]), two mammals (human, Filtering and read correction). Assembly quality was mouse), and one fish (stickleback) (Ensembl 73 [14]). In assessed by chaining the kiwi scaffolds to two Sanger- total we identified 10,096 gene families shared between sequenced bird genomes: chicken [8] and zebra finch [9]. the inferred ancestral state and the 16 species consid- A total of 50.09 % (0.8 Gb) of the kiwi genome is alignable ered, of which 623 represent single-gene families. For in syntenic chains to 79.67 % of the much smaller chicken these single-gene families we constructed a maximum- genome (1.07 Gb). A similar fraction, 57.61 % (0.9 Gb), of likelihood phylogeny [15] (Fig. 1) and tested for changes the kiwi sequence was alignable to 76.92 % of the zebra in ortholog cluster sizes. In accordance with previous es- finch genome (1.2 Gb) (Additional file 1: Table S5). For timates, our results indicate a net gene loss on the avian comparison, 69.86 % (0.84 Gb) of the zebra finch genome branch [16]. Changes of gene-family sizes have been inferred for Table 1 Kiwi genome assembly characteristics and genomic multiple de novo assembled genomes [17, 18]. However, features compared with other avian genomes (see Additional many of these genomes have rather fragmented assemblies file 1: Table S4) (Table 1); thus, results should be interpreted cautiously, Species Size of N50 scaffolds Heterozygous SNP assembly (Gb) (Mb) rate per kb only after manual inspection and ideally independent ex- perimental confirmation. Apteryx mantelli 1.59 4 1.5 We therefore manually examined the 130 gene families Falco cherrug [17] 1.18 4.2 0.8 that had either significant expansion or contraction spe- Falco peregrinus [17] 1.17 3.9 0.7 cifically to the kiwi branch. After excluding expansions Taeniopygia guttata [9] 1.2 10.4 1.4 that were caused by fragmentation of the assembly [19] Ficedula albicolis [90] 1.13 7.3 3.03 only 85 gene families remained significant (Additional Anas platyrhynchos [18] 1.1 1.2 2.61 file 1: Table S6). Of these, 63 gene families are expanded in the kiwi. An analysis of gene family functions [20] Gallus gallus [8] 1.07 15.5 4.5 showing expansion in kiwi identified enrichment in cat- Meleagris gallopavo [91] 0.93 1.5 ~1.36 egories including signal transduction, calcium homeostasis,

263 Le Duc et al. Genome Biology (2015) 16:147 Page 3 of 15

Fig. 1 Phylogenetic tree of 16 species built on 623 TreeFam [12] single-gene families. Branch lengths are scaled to estimate divergence times. All branches are supported by 100 bootstraps. The song bird clade is depicted in blue, Galliformes jn purple, Anseriformes in green, and nocturnal birds in red. Ratites (Struthio camelus and Apteryx mantelli)andTinamus guttatus are highlighted in light green. The number of genes gained (+ red) and lost (− blue) is given underneath each branch. The rate of gene gain and loss for the clades derived from the most common recent ancestor was estimated [77] to 0.0007 per gene per million years and motor activity (FDR <0.0001, Additional file 1: Figure Patterns of natural selection S2A).Amongthegenefamiliesthatshowcontractionon To determine whether any branch-specific selection is the kiwi branch we found an enrichment of development- present in kiwi we estimated branch ω-values (Ka/Ks sub- related Gene Ontology (GO) categories (FDR <0.0001, stitution ratios) for 4,152 orthologous genes in eight bird Additional file 1: Figure S2B). species: kiwi, ostrich, tinamou, chuck-will’s-widow, barn Diversification of tetrapods and the colonization of ter- owl, chicken, zebra finch, and turkey using CODEML restrial habitats are often accompanied by changes of [24]. Ortholog assignment was based on the orthology re- physiological systems specifically in cellular signal trans- lation among chicken, zebra finch, and turkey defined in duction [21]. Membrane proteins are involved in cellular Ensembl 73 (Additional file 1: Note: Orthologs and Ka/Ks signaling, hence we aimed to determine more specifically calculation). The kiwi average ω across all the orthologs is which classes of membrane-expressed proteins have comparable to that in ostrich, and higher than in tinamou undergone changes in the number of coding genes. To and night birds (0.291, 0.313, 0.145, 0.202, and 0.200 for this end we annotated the membrane proteome in kiwi, kiwi, ostrich, tinamou, chuck-will’s-widow, and barn owl, human, all birds, and reptiles present in Ensembl 74, two respectively). This implies a relatively faster overall rate of additional ratites (ostrich and tinamou) and two nocturnal functional evolution in kiwi and ostrich. birds (chuck-will’s-widow and barn owl) (Additional file 1: In addition to gene-family expansions/contractions, Note: Detection and classification of the membrane prote- we used evidence of branch-specific selection to iden- ome; Additional file 1: Table S7). We manually inspected tify genes and functional pathways that may underlie the classes which showed expansion in kiwi, to ensure that kiwi-specific adaptations. For the 4,152 orthologous the higher number of predicted genes is not a result of as- genes in the eight bird species we used the branch models sembly fragmentation. We found a significant expansion from CODEML to perform likelihood ratio tests [24], in kiwi of genes coding for adhesion and immune-related comparing a simple model of one ω for all sites and proteins (Additional file 1: Table S7). Additionally, we branches versus a model where kiwi is defined as the fore- found a significant expansion of the Ephrin kinases class, ground branch and the other birds as background. We which are functionally involved in the development of the first considered genes with a significantly higher ω on the sensory-motor innervation of the limb [22] and later on in kiwi branch than that in all other birds (LRT >3.84, signifi- tendons condensation and developing feather buds [23]. cance at 5 %, 1 degree of freedom). Functional enrichment

264 Le Duc et al. Genome Biology (2015) 16:147 Page 4 of 15

using GO [20] categories was tested using a hypergeo- file 1: Table S8B). Among slower evolving categories, the metric test (Additional file 1: Note: Gene ontology and mitochondrial outer membrane was one of the kiwi- rapidly evolving genes). The same test was performed on specific categories (Additional file 1: Table S9A), while genes evolving significantly slower in kiwi. To assign func- anion channel activity was a shared category with chuck- tional categories as either kiwi-specific, or shared with will’s-widow (Additional file 1: Table S9B). For the poten- other ratites or nocturnal birds, a similar procedure was tially biological meaningful categories which could explain performed for each species of Palaeognathae (ostrich, kiwi-specific physiology we extracted the genes clustering tinamou) and night birds (chuck-will’s-widow, barn owl) in the node. GO categories have a high potential to deliver by assigning each in turn as the foreground branch in false-positive enrichment, which could be considered bio- CODEML. logically meaningful a posteriori [25]. Therefore, future After multiple testing correction using family-wise error studies need to verify the adaptive functionality of genes rate none of the categories remained significant. For fur- belonging to the respective category (Additional file 1: ther analysis we considered only GO categories that had Tables S8C and S9C). (1) a P value <0.05; (2) at least three significantly changed It has been proposed that, in a nocturnal environment, genes; and (3) the number of significant genes was at least genes involved in circadian rhythm have been under se- 5 % of the total genes annotated in the GO category. GO lective pressure [4]. Our species-specific selection screens categories that were over-represented (P value <0.05) on did not identify circadian rhythm-related categories to be the kiwi branch, but not present in any of the other con- enriched for changed genes in either kiwi or the other sidered species, were identified as potentially kiwi-specific nocturnal birds. However, since mutations in even a single changes (Additional file 1: Note: Gene ontology and rap- gene may be relevant, we analyzed more closely bio- idly evolving genes). Notably, faster-evolving categories rhythm regulators from the neuropsin gene family. Ence- present in kiwi, but absent in any of the other species, are phalopsin (OPN3), melanopsin (OPN4-1), and neuropsin related to mitochondrion, feeding behavior and energy re- (OPN5) showed a similar ω in kiwi and the other branches serve metabolic process, visual perception, and eye photo- and no obvious alterations could be detected in the se- receptor cell differentiation (Additional file 1: Table S8A). quence (Table 2). Similar to chicken [26], kiwi and the Sensory perception of light stimulus is a faster evolving other tested birds have a duplication of the melanopsin category shared, surprisingly, with the ostrich (Additional gene (OPN4-2), which displayed significant signals of

Table 2 Annotated opsins in the Apteryx mantelli genome AptMant0 annotation ID External gene Description ω ω Apt. LRT ID background mantelli augustus_masked-scaffold541-abinit-gene-7.0- RHO No obvious alteration 0.044 0.14913 6.128* mRNA-1 augustus_masked-scaffold1311-abinit-gene-0.1- OPN1LW Partial sequence TM7 0.15601 0.59702 1.503 mRNA-1 maker-scaffold728-augustus-gene-1.2-mRNA-1 OPN1MW Deleterious mutation Glu3.49Lys 0.02093 0.26785 44.951* augustus_masked-scaffold1068-abinit-gene-0.2- OPN1SW† Partial sequence, deleterious mutation 0.03815 0.19244 5.162* mRNA-1 Glu6.30Gly augustus_masked-scaffold9587-abinit-gene-0.0- SWS2†† Partial sequence 0.02045 0.0001 0.514 mRNA-1 maker-scaffold19-augustus-gene-28.1-mRNA-1 OPN3 No obvious alteration 0.10965 0.54221 3.211 augustus_masked-scaffold39-abinit-gene-55.0- OPN4-1 No obvious alteration 0.14205 0.23127 2.733 mRNA-1 augustus_masked-scaffold122-abinit-gene-6.0- OPN4-2 No obvious alteration 0.18597 2.57434 8.194* mRNA-1 maker-scaffold597-augustus-gene-1.2-mRNA-1 OPN5 No obvious alteration 0.07114 0.0001 1.733 augustus_masked-scaffold1987-abinit-gene-3.0- opsin-VA-like No obvious alteration 0.31735 0.26196 0.035 mRNA-1 LRT = likelihood ratio testing with one degree of freedom, between the null model (model = 0) and a model where the kiwi branch differs from other birds: chicken, turkey, zebra finch, chuck-will’s-widow, barn owl, tinamou, and ostrich (model = 2), implemented in CODEML from the PAML package [24]. Extended selection analysis in which nocturnal birds, ostrich, and tinamou are sequentially appointed as foreground branch are presented in Additional file 1: Table S10. *P value <0.05 †Tested on orthologs in Tinamus guttatus, Antrostomus carolinensis, Taeniopygia guttata, Gallus gallus, and Apteryx mantelli (not present in Struthio camelus and Tyto alba assemblies) ††Tested on orthologs in Chlamydera nuchalis, Chlamydera maculata, Sericulus chrysocephalus, Ptilonorhynchus violaceus, Scenopoeetes dentirostris, Ailuroedus crassirostris, Falco cherrug, Columba livia, and Apteryx mantelli

265 Le Duc et al. Genome Biology (2015) 16:147 Page 5 of 15

positive selection in kiwi but not in the other nocturnal Besides these two functionally well-characterized posi- birds. However, a branch-site selection analysis of this tions, we identified several other amino acids substitu- gene did not show any significant positively selected sites tions in kiwi OPN1MW and OPN1SW. Further, tests for (Additional file 1: Note: Vision analysis). branch and branch-site specific ω values for OPN1MW and OPN1SW on the kiwi branch showed no evidence for positively selected sites in kiwi (Additional file 1: Kiwi sensory adaptations – vision Note: Vision analysis), suggesting that the greater ω Nocturnality is accompanied by a number of specific values for kiwi are likely due to loss of constraint on changes, including adaptations in visual processing [4]. these genes. Hence these genes are likely to be drifting In contrast to most nocturnal , that have large and, considering the fact that only 8 % of all inactivating eyes relative to their body size, kiwi have small eyes and mutations in GPCRs are stop codons while almost 65 % reduced optic lobes in the brain [27]. However, the kiwi are missense mutations [35–37], the described loss-of- retina has a higher proportion of rods than cones which function mutations in OPN1MW and OPN1SW render is consistent with adaptation to nocturnality [3]. Besides color vision of kiwi, unlike for other sequenced ratites black/white vision mediated via rhodopsin (RHO), most (Fig. 2), absent – at least for the green and blue spectral birds have trichromatic or tetrachromatic vision, for which ranges. various additional opsins are responsible: OPN1LW (red), We tentatively dated the opsin-loss-of-function event OPN1MW (green, RH2), OPN1SW (blue, subtypes SWS1, as an indicator of the timing of adaptation to the noctur- SWS2) [28]. We identified these genes in the kiwi assem- nal niche. Assuming that the loss of constraint happened bly. The RHO gene in kiwi shows no interruption and no on the kiwi branch in a short period of time and chan- obvious function-impairing amino acid changes compared ged the rate of selection, measured by the ω value, from to other . We were able to assemble only a par- the average over bird lineages (0.021 for OPN1MW and tial sequence of the red opsin OPN1LW (transmembrane 0.014 for OPN1SW, Table 2) to the neutral ω value of 1, (TM) helix 7) and found no previously described deleteri- the loss of function was dated to 30–38 million years ous amino acid changes within this region [29]. ago (Additional file 1: Note: Vision analysis), which In the green opsin, OPN1MW, we identified a Glu134 places the event shortly after the arrival of kiwi in New to Lys substitution (relative position 3.49 in the Zealand [38]. Ballesteros and Weinstein nomenclature) in the highly conserved D/ERY motif of this rhodopsin-like GPCR. Kiwi sensory adaptations – olfaction We confirmed this mutation in a second Apteryx man- Kiwi are unique among birds in having nostrils telli individual, as well as in other kiwi species (Fig. 2). present at the end of their prominent beaks and have To determine whether the change is kiwi-specific we se- been reported to depend largely on tactile and olfac- quenced this domain of OPN1MW in other ratites, in- tory senses for foraging [39]. To investigate whether cluding the extinct moa. We found that Glu3.49 is 100 % the genome shows signs of olfactory adaptation in conserved in all birds for which sequence was available kiwi we assessed the numbers of olfactory receptor and also in over 250 other orthologs. Previous (OR) genes [40] and the diversity in the OR sequence experimental analysis showed that mutation of Glu3.49 to [41]. Arg – another basic amino acid – results in a non- The only previous approach to molecular characterization functional receptor protein [30]. Furthermore, the Asp of the olfactory system in kiwi was based on PCR amplifi- or Glu in the D/ERY motif is also highly conserved in cation of ORs with degenerate primers [42]. This allowed most other rhodopsin-like GPCRs and the identical mu- only a rough estimation of the number of ORs of 478 tation of Glu3.49 to Lys in the thromboxane A2 receptor, genes (95 % confidence interval 156–1,708 genes). PCR for example, prevents the receptor from being function- with degenerate primers only produces incomplete frag- ally expressed on the plasma membrane [31]. ments of the genes and hence the accurate quantification Similarly, at the N-terminal end of TM6 in OPN1SW of gene families with highly similar sequences, as in the we identified a highly conserved Glu6.30 which is present case of ORs, is prone to over-estimation [43]. In contrast, in all bird orthologs sequenced so far, except for kiwi de novo genome assembly facilitates a global assessment OPN1SW where Glu6.30 is substituted by Gly. Previous of the gene repertoire [44] and can therefore be used to functional characterization has shown that mutation of provide a more accurate estimate of the OR repertoire. Glu6.30 destabilizes the H-bond network resulting in We thus annotated the OR genes in kiwi, as part of the constitutively active opsins and other rhodopsin-like entire membrane proteome, on the basis of putative GPCRs [32, 33]. A constitutively active opsin is function- functionality and seven transmembrane helices (7TM) ally incapable of light signal transmission [34] and is (Additional file 1: Note: Olfactory receptor genes identifi- therefore non-functional. cation and annotation). The number of non-OR receptor

266 Le Duc et al. Genome Biology (2015) 16:147 Page 6 of 15

Fig. 2 Protein sequence comparison revealed substitutions of Glu3.49 to Lys (E/DRY motif) and Glu6.30 to Gly in kiwi OPN1MW (RH2)and kiwi OPN1SW, respectively. Both residues are 100 % conserved in all birds sequenced so far and over 100 publicly available sequences of other vertebrate OPN1MW and OPN1SW orthologs. To assure the OPN1MW-change is kiwi-specific additional ratites were sequenced, including different kiwi species and the extinct moa. Glu3.49 of the E/DRY motif and Glu6.30 at the N-terminal end of helix 6 are parts of an ‘ionic lock’ interhelical hydrogen-bond network which is highly conserved in many rhodopsin-like GPCRs. Nb – North Island brown kiwi, Ob – Okarito brown kiwi, Gs – Great spotted kiwi, Ec – Emeus crassus (Eastern moa), Pg – Pachyornis geranoides (Mappin’smoa), Chuck-will – Chuck-will’s-widow families was comparable to other avian species, suggesting up to 141 OR genes are present in the kiwi genome, that the membrane proteome is well annotated in kiwi of which 86 encode for full-length receptors while the (Additional file 1: Table S7). This analysis revealed an ini- rest are most likely pseudogenes due to frameshifts, tial set of 82 OR genes in the kiwi genome. However, ORs premature stop codons, or truncations (Additional file are highly duplicated across the genome and such regions 1: Note: Olfactory receptor genes identification and an- could be prone to being overcollapsed during the notation). The estimated proportion of intact ORs assembly process. We therefore estimated the copy num- among all OR genes in kiwi (61 %) is lower than previ- ber of each annotated OR using a correction based on ously reported for Apteryx australis [42] (78.6 %), but coverage. To obtain the correction factor for each OR, much higher than in zebra finch (38 %) [45]. read-coverage in the OR region was divided by the Comparative analysis of the OR repertoire shows that genome-wide average coverage corresponding to its the kiwi genome has both the α and the γ subgroups of GC bin. Following this correction we estimated that type 1 OR genes, as reported for other bird genomes

267 Le Duc et al. Genome Biology (2015) 16:147 Page 7 of 15

sequenced so far [45]. Unlike the majority of other birds Phylogenetic comparison of OR repertoires suggest analyzed so far, kiwi has a higher number of γ subgroup that γ ORs within bird and reptile genomes exhibit con- ORs. Gene family size estimates are highly dependent on trasting evolutionary rates. Tree topology suggests that γ genome quality [46] and continuous curation is ongoing ORs in a few birds and reptiles show species-specific even for well-annotated genomes: for example, in the clustering pattern (Fig. 3). This pattern was previously chicken olfactory repertoire the number of annotated described in birds and it was suggested that these recep- ORs changed by a factor of eight in two consecutive tors have undergone adaptive evolution with respect to Ensembl releases (release 73 – 251 ORs and release 74 – the occupied environmental niche [45]. However, a few 30 ORs). Further improvement of genome qualities, in- γ ORs belonging to kiwi cluster with their reptilian cluding kiwi, are therefore required for the identification counterparts, while some cluster basal to the clade con- of a complete set of ORs. Thus, a correlation between taining most bird γ ORs (Fig. 3). olfactory acuity and the number of ORs in different Phenotypic diversity in olfaction is, in part, attributable birds could be subject to error. to genetic variation with a wider range of odors thought

Fig. 3 Maximum likelihood (ML) tree constructed using full-length intact α and γ group olfactory receptors from 10 birds (chicken, zebra finch, flycatcher, duck, turkey, chuck-will’s-widow, barn owl, ostrich, tinamou, and kiwi) and two reptile genomes (anole lizard and Chinese soft-shell turtle). The ML topology shown above was cross-verified using the neighbor joining (NJ) method. Three Class A (Rhodopsin) family GPCRs from chicken genome, dopamine receptor D1 (DRD1), dopamine receptor D2 (DRD2), and histamine receptor H1 (HRH1) were used as the out-group (shown as non-olfactory receptors). The red dot indicates confidence estimates (% bootstrap from 500 resamplings, >90 % bootstrap support from both ML and NJ methods) for the nodes that distinguish α and γ ORs. The scale bar represents the number of amino-acid substitutions per site. The topology supports lineage specific expansions of γ group olfactory genes in the bird and the reptile species. Note, a few of the γ group ORs in kiwi cluster with reptilian ORs (highlighted by orange arrowhead), while some cluster basal to the clade containing bird ORs (highlighted by green arrowhead). The topology supports contrasting evolutionary rates within the analyzed γ ORs, as indicated by short (blue arc with arrowheads) and long branch lengths (pale orange arc with arrowheads). The inset shows the number of intact olfactory receptors in each species that are analyzed using the ML tree topology

268 Le Duc et al. Genome Biology (2015) 16:147 Page 8 of 15

to be detectable given more genetic variation [41]. Since were then manually inspected. No insertions, deletions, the absolute number of ORs might be a poor predictor and/or stop codons that would clearly disrupt the open of olfactory abilities, we investigated the variation in the reading frame could be identified in the inspected genes. γ ORs sequence as a measure of the range of possible Additionally, we found all 39 HOX genes expected for detectable odors. The average protein sequence entropy the Sauropsid ancestor [54] and investigation of regula- was calculated to check for variation within the γ-c clade tory sequences within the HOX clusters by phylogenetic in each species (Additional file 1: Note: γ-c clade OR footprinting showed no preferential loss of conserved within-species protein sequence entropy). DNA elements in Apteryx mantelli compared to Galli- Previous studies have shown that Shannon entropy formes (Additional file 1: Figure S4; Additional file 1: (H) analysis is a sensitive tool for estimating the diversity Table S11). of a system [47, 48]. For protein sequence, H ranges To detect signs of different evolution in kiwi wing from 0 (only one residue is present at that position in and tail developmental genes we performed a selective the multiple sequence alignment) to 4.322 (all 20 resi- constraint analysis using the CODEML branch test dues are equally represented in that position). Typically (Additional file 1: Note: Selection analysis on limb de- H ≤2 is attributed to high conservation [49]. H values in velopment genes; Additional file 1: Table S12). Of birds were in the range of 0.34±0.05 (zebra finch) to these genes FIBIN was the only gene that showed sig- 1.11±0.12 (chicken). The average entropy in kiwi se- nals of positive selection on the avian tree including quences was 1.23±0.15, significantly higher than all other chicken, turkey, and zebra finch (Additional file 1: bird species investigated (P value = 0.003 Wilcoxon Figure S5). Three sites with signs of positive selection Signed-Rank test, Additional file 1: Note: γ-c clade OR that were 100 % conserved in the other species show within-species protein sequence entropy). We conclude a different amino acid in kiwi: exchanges of Ser136Ala, that overall the γ-c clade of ORs are highly similar in se- Gln148Arg, and Phe162Cys(positionsarerelativeto quence, in accordance with previously published data the mouse Fibin coding sequence). The functional [45]. However, since detection of a wider range of odors relevance of these substitutions is unclear and needs is correlated to genetic variation of ORs [41], the signifi- to be studied when experimental tests of FIBIN func- cantly higher H in kiwi ORs is suggestive for a broad tion become available. odor acuity in this species in comparison to other birds. Since no obvious alterations could be found in the coding sequences of genes involved in developmental Kiwi morphology processes, which could explain the regressed-wing The most prominent phenotype of kiwi, lack of wings, morphology of kiwi, we further analyzed ultra-conserved has been linked to energy conservation [50] and to the non-coding elements (UCNEs) (Additional file 1: Note: limited resources in New Zealand in late Oligocene [51]. Ultra-conserved non-coding elements analysis). UCNEs Like most ratites, kiwi are flightless, but the phylogenetic are defined as DNA non-coding regions of ≥95 % se- tree of Palaeognathae implies that this phenotype quence identity between human and chicken, longer evolved several times independently in this order [38]. than 200 bp [55]. The majority of UCNEs cluster in gen- Unlike ostriches and rheas, that possess prominent omic regions containing genes coding for wings, kiwi show only vestigial invisible wings, while factors and developmental regulators [56] and experi- moa lack even vestiges [52]. mental studies in transgenic animals have shown that To determine whether we can identify the genetic some of these sequences can act as tissue-specific en- basis for the extremely regressed wings in kiwi we anno- hancers during developmental processes [57]. Of the tated genes in the highly conserved signaling pathways 4,351 UCNEs annotated in UCNEbase [55], 19 showed related to limb development (Additional file 1: Note: more than the expected 5 % sequence variation as de- Kiwi morphology analysis; Additional file 1: Figure S3). fined in the database [55] (Additional file 1: Table S13). These include genes belonging to the FGFs, TBX cluster, Among these, four were related to HOXA, TBX2, Sp8, HOX cluster (Additional file 1: Figure S4; Additional file and TFAP2A genes which have been previously de- 1: Table S11), WNT, SALL,andFIBIN genes, known to scribed in limb development pathways [53, 58, 59], sug- be responsible for limb and wing development [53] gesting that changes in non-coding elements could be (Additional file 1: Table S12). Growth and transcription involved in kiwi’s loss of wings. factors typically influence the development of both upper and lower limbs, while FIBIN is currently the only Discussion gene described to be exclusively involved in the develop- With their small body size, extremely large egg size, noc- ment of the upper limb [53]. turnal life style, and prominent nostrils at the end of For these clusters of genes, we aligned corresponding their beaks, among several other traits, kiwi represent orthologs and translated multiple alignments, which probably the most unusual member of the ratites [60]. A

269 Le Duc et al. Genome Biology (2015) 16:147 Page 9 of 15

recent mitochondrial DNA phylogeny placed kiwi as the dominant diurnal taxon at this time [4]. According to closest relatives of the extinct Madagascan elephant this hypothesis, several traits typical for mammals, in- birds [38]. Whether dispersal or vicariance best describe cluding a well-developed sense of smell, limited color ratite distribution has been debated for over a century vision, increased eye size, and an energetic metabol- [61]. A phylogeny including 169 bird species, built on 32 ism optimized for sun radiation-independent body kb from 19 independent loci, showed ostrich as basal in temperature regulation, have been shaped by the noc- the Palaeognathae clade [62]. In contrast, our phylogeny, turnal environment [65, 66]. Nocturnally adapted based on 623 1:1 orthologs in 16 species, totaling ap- Mesozoic mammals also tended to have a small body proximately 700 kb, places the tinamou as basal to size, an insectivorous diet, and low energy metabolism Palaeognathae with 100 % bootstrap confidence (Fig. 1; [67]. Interestingly, kiwi has the smallest body size Additional file 1: Figure S6). However, when the phyl- among flightless ratites, the lowest metabolic rate ogeny was constructed for 10 bird species using just among birds [68, 69], and an insectivorous diet, sug- UCNEs (totaling >1 Mb) the topology of the tree gesting a pattern of evolution that is similar to the matches that obtained from fewer loci from a larger evolution of mammals under nocturnality. Consistent number of species which agrees with a previous publica- with this hypothesis, our genome-wide scans for pat- tion [62] (Additional file 1: Figure S7). Including more terns of positive selection showed enrichment in GO ratites and a larger number of (hand-curated) loci should categories like mitochondrion functions and energy provide better resolution of the tree topology, and in- reserve metabolic process (Additional file 1: Table deed the topology we obtain here is well-supported. S8A), both related to metabolic rate. Moreover, we However, we note that the topology changes depending found strong evidence for a loss of color vision in on the gene sets that are included (Additional file 1: Figs. kiwi and their retinal structure also clearly supports S6 and S7) and that when using ultra-conserved se- adaptation to vision under low light levels [3]. Al- quences the phylogeny differs from that obtained from a though the small eye size of kiwi [27] is unusual for larger, more representative set of genes. Hence, future a nocturnal species, based on the retinal anatomy availability of additional genomes and ortholog sets from Corfield et al. rejected a regressive evolution model multiple ratites will allow a better understanding of their for kiwi vision and suggested that kiwi have an acuity origin. in detecting low light levels similar to other nocturnal Nevertheless, a previous study has estimated that kiwi species [3]. This suggests that molecular mutations diverged from the Madagascan elephant birds about 50 and retinal structure changed faster than eye size. In million years ago [38] (Additional file 1: Figure S8). This birds,eyesizewasdescribedtoscaletobodymass estimate post-dates the split of Madagascar and New with an exponent similar to brain mass and metabolic Zealand from Gondwana, which took place around 100 rate [70]. Thus, the low metabolic rate of kiwi [68] and 80 million years ago, respectively, and implies that could be the constraint for their relatively small eyes. ratites must have dispersed by flight and also that kiwi Alternatively, kiwi might serve as an example that ad- arrived on New Zealand less than 50 million years ago. aptations in the retinal structure could be sufficient, This conclusion is supported by the fossil record in New and changes in eye size are not absolutely necessary. Zealand, which includes a flighted kiwi ancestor [63]. At This conclusion may be supported by the absence of the time kiwi arrived, moa already inhabited New variation in eye shape according to activity pattern Zealand and it has been hypothesized that moa were observed in lizards and non-primate mammals [71]. monopolizing the diurnal ground niche, which forced It has long been hypothesized that unlike most bird kiwi to adapt to an alternative nocturnal lifestyle [38]. species kiwi is more similar to mammals in their reliance This would suggest that kiwi adapted to the nocturnal on olfactory and mechanical cues for foraging, perceived niche soon after arriving on the island. The loss of func- by the nostrils and mechanoreceptors located at the end tion that we observe in OPN1SW is indicative of adapta- of its bill, for foraging [72]. We found that the kiwi, un- tion to nocturnality [64]. We dated the loss of function like other ratites, has an increased diversity in the bird- in several color vision opsins to 30–38 million years ago, specific γ-c clade ORs. Since OR diversity is hypothe- which is consistent with the arrival of the kiwi in New sized to correlate positively with olfactory acuity in ver- Zealand less than 50 million years ago, and their subse- tebrates [42, 73], the significantly higher diversity in kiwi quent adaptation to a nocturnal niche. ORs compared to other birds (Additional file 1: Figure In contrast to birds, which almost certainly have a di- S9) suggests that kiwi may be able to distinguish a larger urnal origin, the nocturnal bottleneck hypothesis sug- range of odors than other birds. gests that mammals were nocturnal for about 160 Steiger et al. formulated two possible scenarios that million years in their evolution as they were restricted to could explain γ ORs evolution in birds: the first hypoth- nighttime activity to avoid dinosaurs which were the eses that species-specific γ ORs arose from independent

270 Le Duc et al. Genome Biology (2015) 16:147 Page 10 of 15

expansion events in each species, while the second as- sequencing; Additional file 1: Table S1). Paired-end sumes that the ancient γ OR clade was more diverse and sequencing was performed on HiScanSQ and HiSeq became homogenized by concerted evolution within spe- platforms with read lengths of 101 bp and 96 bp, cies [45]. Some γ ORs of kiwi, ostrich, tinamou, and respectively. nocturnal birds clustered with their reptilian counter- Sequencing errors were corrected using Quake [5] parts, while others clustered basal to the clade contain- (Additional file 1: Note: Filtering and read correction; ing most bird γ ORs (Fig. 3). This supports a two-fold Additional file 1: Figure S1). A total of 52.53 Gb of high- conclusion: (1) γ ORs in kiwi are more diverse in se- quality sequence was used for de novo assembly with quence than in other birds investigated, which was veri- SOAPdenovo [6]. The short-insert-size libraries (240 bp, fied by the significantly higher sequence entropy; and (2) 420 bp, 800 bp) were used to build contigs. Based on since kiwi is basal to the Neognathae (Fig. 1), the ances- paired-end information scaffolds were generated using tral state of γ OR clade is probably diversified compared all libraries (2 kb, 3 kb, 4 kb, 7 kb, 9 kb, 11 kb, 13 kb). to other modern birds. Remaining gaps in the scaffolds were closed using the paired-end information (Additional file 1: Note: Genome Conclusions assembly). This final assembly (AptMant0) was used for Since its arrival in New Zealand sometime after 50 all subsequent analyses. million years ago, the kiwi adapted to a nocturnal, Gene annotation was performed with the MAKER ground-dwelling niche. The onset of adaptation to pipeline [10], using several sources of evidence: de nocturnality appears to have been approximately 30– novo gene predictions, RNA-Seq data, and protein 38 million years ago, about one-fifth of the time pro- evidence from three species (G. gallus, T. guttata,and posed for the evolution of mammals in a nocturnal M. gallopavo) (Ensembl version 72). Briefly, after re- environment. The molecular changes present in the peat masking, gene models were predicted by Augus- kiwi genome are in accordance with the adaptations tus version 2.7 [74] using the training dataset for that are hypothesized to have occurred during early chicken. Apteryx mantelli RNA-Seq data were then mammalian adaptation to nocturnality. This suggests aligned to AptMant0 using NCBI BLASTN version similar patterns of adaptation to the nocturnal niche 2.2.27+ [75] and BLASTX was used to align protein both in kiwi and mammals. Further comparative ana- sequences to identify regions of homology. Finally, lyses, including other diurnal Palaeognathae,aswell using both the ab initio and evidence-informed gene as additional nocturnal bird groups and their diurnal predictions, Maker updated features such as 5’ and 3’ sister species, should shed further light on the gen- UTRs based on RNA-Seq evidence and a consensus omic imprints of adaptation to a nocturnal life style. gene set was retrieved (Additional file 1: Note: De novo gene prediction and gene annotation). Methods and materials Genome sequence assembly and annotation Comparative genome analysis We sequenced Apteryx mantelli female individuals, which Triplet orthologs between chicken, zebra finch, and originate from the far North (kiwi code 73) and central turkey were downloaded from Ensembl 73. Kiwi genes part – Lake Waikaremoana (kiwi code AT5 and kiwi code were considered orthologs to a triplet if the ortholog 16–12) of North Island (Additional file 1: Figure S10). assignment from Maker agreed with the orthologous They were sampled in 1986 (kiwi code 73) and 1997 (kiwi gene assigned in each of the three considered species. code AT5 and 16–12) in ‘operation nest egg’ carried out The ostrich, tinamou, chuck-will’s-widow, and barn owl by Rainbow and Fairy Springs, Rotorua. No animals were orthologs were assigned by orthology to the chicken killed or captured as a result of this study and genome as- proteins. After assigning orthology in the eight avian sembly was performed with iwi approval from the Te species, coding sequences were aligned and two different Parawhau and Waikaremoana Māori Elders Trust. sets of alignments were compiled for further analysis: We extracted genomic DNA from Apteryx mantelli Set 1: alignments of all eight species that do not con- embryos. Libraries with insert sizes of 240 bp, 420 tain a single frameshift indel. bp,800bp,2kb,3kb,and4kbwereobtainedfrom Set 2: the longest uninterrupted run of at least 200 individual kiwi code 73, and mate-paired-end libraries aligned bases in each multiple sequence alignment, for 7 kb, 9 kb, 11 kb, and 13 kb, from individual kiwi which we first ensured that gaps in the alignment were code 16–12. DNA from individual AT5 was used to not introduced by unresolved bases in our assembly. build a 350 bp insert-size library with the purpose of TheCODEMLprogramfromthepackagePAML[24] confirming kiwi-specific sequence polymorphisms and was run first on four avian lineages: G. gallus, T. gut- was not included in the genome assembly (Additional tata, M. gallopavo,andA. mantelli to compare the kiwi file 1: Note: Sampling, DNA library preparation and genome to high-quality annotated ones. Six pairwise

271 Le Duc et al. Genome Biology (2015) 16:147 Page 11 of 15

combinations were run to obtain estimates of non- option of 0.0007 (Additional file 1: Note: Gene fam- synonymous (Ka) and synonymous (Ks) changes in the ilies evolution using CAFE). Pfam IDs corresponding four avian lineages. Ka and Ks distributions were com- to the TreeFam families were assigned to GO categor- pared pairwise between all four avian species on a set ies. We tested whether significant (P <0.05) contraction/ of 3,754 orthologous genes which presented no frame- expansion events cluster in different GO categories using shifts or indels (Additional file 1: Figure S11). ClueGO with a hypergeometric test [78] (Additional file 1: We next scanned for differently evolving genes with the Figure S2). CODEML program under a branch model (model = 2, two ωs for foreground and background branches, respect- ively, vs. model = 0, one ω for all branches, compared via Membrane proteome annotation likelihood ratio test) [24] using the set of orthologs as de- Complete protein sequence sets for the following bird fined above in the eight bird species (Additional file 1: and reptile species were downloaded from Ensembl 74 Note: Orthologs and Ka/Ks calculation). [14]: Taeniopygia guttata, Meleagris gallopavo, Ficedula Branch specific ω values were used to identify GO albicollis, Anas platyrhynchos, Pelodiscus sinensis, Gallus categories that are evolving significantly different on gallus, and Anolis carolinensis. Homo sapiens from the each of the following bird species: kiwi, ostrich, tina- same Ensembl version was used as outgroup. Protein se- mou, barn owl, and chuck-will’s-widow. GO categories quences of ratites (Tinamus guttatus, Struthio camelus) enrichment was tested using the FUNC [76] package. and nocturnal birds (Antrostomus carolinensis, Tyto A hypergeometric test was run for each species sep- alba) were downloaded from GigaDB [13]; although arately on genes having a significantly higher ω.Mul- these genomes are more fragmented than the ones from tiple testing correction was done using family-wise Ensembl, annotation of the membrane proteome in birds error rate. Categories with P value <0.05 were consid- adapted, like kiwi, to the nocturnal niche and the ones ered for further analysis if at least three significantly belonging to the same clade as kiwi, allows to differenti- changed genes were present in the GO category, and ate between events that are clade-specific or shaped by the number of significant genes was greater or equal nocturnality. Only the longest protein sequence for each to 5 % of the total genes annotated in the respective gene was considered for analysis. Membrane proteins GO category. The same test was applied on genes and signal peptides were predicted for all species with with a significantly smaller ω in each of the species. Phobius [79]. These proteins were classified based on a Kiwi-specific categories were considered those which manually curated human membrane proteome dataset, showed no enrichment in any of the other ratites or which describes family relationship and molecular func- night birds (Additional file 1: Note: Gene Ontology tion. The predicted membrane proteins were aligned to and rapidly evolving genes). the human membrane proteome dataset with the BLASTP We used the TreeFam methodology to define gene program of the BLAST package using default settings families [12] across 16 genomes: Gallus gallus, Anas (v. 2.2.27+) [75]. Each predicted membrane protein was platyrhynchos, Ficedula albicollis, Meleagris gallopavo, classified according to its best human hit with an e-value − Taeniopygia guttata, Pelodiscus sinensis, Anolis caroli- <10 6. Predicted membrane proteins with no hit were nensis, Homo sapiens, Mus musculus, Gasterosteus acu- deemed unclassified, along with those proteins that hit leatus, Ornithorhynchus anatinus, downloaded from an unclassified human protein (Additional file 1: Note: Ensembl 73 [14], Tinamus guttatus, Struthio camelus, Detection and classification of the membrane prote- Antrostomus carolinensis, Tyto alba, downloaded from ome; Additional file 1: Table S7). GigaDB [13], and Apteryx mantelli. The longest tran- script was chosen for further analysis. For the single- copy orthologous families, genes were aligned against Vision evolutionary analysis each other. To build a consensus phylogenetic tree Opsins are G protein-coupled receptors known to play a (Fig. 1) the resulting alignments were loaded in PAUP* role in light signal transduction and night-day cycle [15] version 4.0d105 and trees were inferred using max- (Table 2). For these genes ω was estimated by appointing imum likelihood, with default parameters. To measure sequentially kiwi, ostrich, tinamou, chuck-will’s-widow, the confidence for certain subtrees, a series of 100 boot- and barn owl as the foreground branch under the strap replicates were performed (Additional file 1: Note: CODEML branch model (model = 2) [24] as described for Nuclear loci phylogeny). comparative genome analysis. Inactivating mutations were We determined the branch-specific expansion and verified by checking that they were present in reads from contraction of the orthologous protein families among both sequenced individuals and in other kiwi species, by the 16 species using CAFE (computational analysis of Sanger sequencing (OPN1MW) (Fig. 2; Additional file 1: gene family evolution) version 3.0 [77] with lambda Note: Vision analysis).

272 Le Duc et al. Genome Biology (2015) 16:147 Page 12 of 15

Olfaction evolutionary analysis corresponding coverage (that is, 35-fold). The final num- Olfactory receptors (ORs) in kiwi were annotated using ber of estimated ORs was obtained by multiplying the both the Augustus de novo gene prediction and the number of initially annotated genes with their correspond- Maker information after scaffold positions were checked ing correction factors. and redundant sequences were removed. Using the same annotation procedure, the OR gene We then performed four steps (Additional file 1: repertoire was estimated in all bird and reptile genomes Figure S12): from Ensembl 74, two nocturnal birds (chuck-will’s- widow and barn owl) and two Palaeognathae (ostrich i. Functional ORs from chicken [45] were downloaded and tinamou) for comparative phylogenetic analysis with and aligned against the kiwi transcriptome using the kiwi OR dataset. All obtained OR genes were then TblastN with default parameters. After collecting aligned using MAFFT [81] v7, with BLOSUM62 as the overall hits for each query (every chicken OR served scoring matrix and default settings of option E-INS-I. as query), identical (same) hits from each run were Phylogenetic analyses were run using both maximum removed to obtain a non-redundant dataset. likelihood (ML) and neighbor joining (NJ) methods ii. A Pfam search against the kiwi proteome with a (Additional file 1: Note: Comparative phylogenetic ana- default e-value cutoff of 1.0 was used to identify lysis on ORs from kiwi and other bird and reptile ge- sequences that contained 7tm_4 domain (olfactory nomes). The reliability of the phylogenetic trees was domain). evaluated with 500 bootstrap replicates. iii. The 7tm_4 domain was searched against the kiwi We calculated Shannon entropy (H) using within spe- proteome by a CDD search (conserved domain cies multiple sequence alignments of γ ORs for all birds database search). and reptiles genomes separately with a built-in function iv. Separate HMM profiles were built from conserved from BioEdit [82] (Additional file 1: Note: γ-c clade OR 7tm regions of functional ORs of chicken, turkey, within-species protein sequence entropy). and zebra finch obtained from previous studies [45]. Using the three HMM profiles, HMM Kiwi morphology searches were performed against the kiwi Previously characterized wing development genes [53] proteome and non-redundant hits were retrieved were assigned orthologs in kiwi, chicken, zebra finch, from combined results of all three searches. and turkey (Additional file 1: Figure S3; Additional file 1: Table S12). We aligned the sequences and multiple align- A CD-HIT (Cluster Database at High Identity with ments were translated and manually inspected for se- Tolerance) was performed to remove identical sequences quence differences as well as insertions/deletions and with a cutoff of 100 %. Preliminary phylogenetic analysis rearrangements. We examined selective pressures under was performed using a maximum likelihood approach the branch models implemented in CODEML [24]. The (Additional file 1: Note: Olfactory receptor genes identi- one-ratio model (model = 0, NSsites = 0) was used to esti- fication and annotation). Non-ORs were removed if they mate the same ω ratio for all branches in the phylogeny. clustered separately from ORs. We excluded pseudogene Then, the two-ratio model (model = 2, NSsites = 0), with candidates if at least one premature stop codon and/or a background ω ratio and a different ω on the kiwi branch, frameshifts could be identified in the kiwi sequence. was used to detect selective pressure acting specifically on OR repertoire estimates were curated based on genomic the kiwi branch. These two models were compared via a coverage calculated using samtools mpileup version 0.1.18 LRT (1 degree of freedom), as mentioned above [83]. [80] on the alignment of the 240 bp, 420 bp, 800 bp Scaffolds and isolated contigs harboring (putative) HOX insert-size libraries to AptMant0 (Additional file 1: Note: genes were identified by BLAST and mapped to all 673 Olfactory receptor genes identification and annotation). sauropsid HOX protein sequences from GenBank. Trans- The correction factor for each annotated OR was obtained lated HOX sequences of Apteryx were aligned to the HOX by dividing the read coverage in that region to the GC- proteins extracted from Genbank and differences were content corresponding average coverage over the entire identified by manual inspection. Potential regulatory se- genome. For example, if an OR sequence had a GC quences in the HOX cluster region were identified by content of 50 %, we calculated the average genome-wide phylogenetic footprinting using tracker2 [84] (Additional coverage corresponding to the GC bin of 50 % to be 35- file 1: Figure S4). fold (Additional file 1: Note: Genome coverage and To retrieve the entire coding region of the FIBIN gene estimation of genome size; Additional file 1: Figure S13). in kiwi, we designed primers based on the chicken and Given a coverage in the respective OR region of 105-fold, ostrich sequence (Additional file 1: Table S14). Using the we obtained a correction factor of 3 after dividing the OR 276-bp fragment amplified by Sanger sequencing, we sequence coverage (that is, 105-fold) by the GC-bin blasted transcriptome sequences from kiwi and iteratively

273 Le Duc et al. Genome Biology (2015) 16:147 Page 13 of 15

assembled the entire coding sequence. Since FIBIN Authors’ contributions showed signs of positive selection in the preliminary DLD, LH, and TS performed the experiments. DLD, GR, KP, MO, AK, MSA, HBS, SJP, PFS, and BDB analyzed the data. DLD, MH, JK, and TS designed the analysis as described above, extended selection analysis study and wrote the paper with contributions from all authors. DL provided was performed using 15 species: human, mouse, bat, biological samples. All authors read and approved the final manuscript. whale, dolphin, turtle, lizard, python, flycatcher, chicken, zebra finch, frog, zebrafish, and pufferfish (Additional file Acknowledgments 1: Note: Fibin identification and selection analysis; This work was supported by grants of the Deutsche Forschungsgemein- Additional file 1: Figure S5). The branch-site tests were schaft and intramural support (Medical Faculty, University of Leipzig), as well used to detect signals of selective pressure on each branch as the Australian Research Council, the Swedish Research Council, NSERC (postgraduate fellowship to GR), and the Max Planck Society. BDB was (NSsites = 2, model = 2, compared to the same model but funded by grant no. 2011/12500-2, São Paulo Research Foundation (FAPESP). with omega fixed to 1, via LRT). Amino acid changes with This research was endorsed by Māori Elders from the Te Parawhau Trust and signs of selection and specific for the kiwi were visualized from Waikaremoana iwi. We are very thankful for technical and methodical support provided by Knut Finstermeier, Anne Butthof, Knut Krohn, Michael in both sequenced individuals. Dannemann, Udo Stenzel, Mathias Stiller, and Rigo Schulz. We thank Andreas Chicken UCNEs annotations were downloaded from Reichenbach for helpful discussions on kiwi vision and Petra Korlević for the the ultra-conserved non-coding element UCNEbase drawings in Fig. 1 and Additional file 1: Figure S10. [55]. Orthologous regions in Apteryx mantelli and Author details Struthio camelus, Tinamus guttatus, Tyto alba, Antros- 1Institute of Biochemistry, Medical Faculty, University of Leipzig, Johannisallee 2 tomus carolinensis genomes, downloaded from GigaDB 30, Leipzig 04103, Germany. Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig 04103, Germany. [13], and birds from Ensembl 74 [14] Ficedula albicollis, 3Department of Neuroscience, Unit of Functional Pharmacology, Uppsala Taeniopygia guttata, Anas platyrhynchos,andMeleagris University, Box 593Husargatan 3, Uppsala 751 24, Sweden. 4Griffith School of gallopavo were established using Blast 2.2.25 [85] with Environment and School of Biomolecular and Physical Sciences, Griffith University, Nathan, Queensland 4111, Australia. 5Department of Computer ‘blastn’ and default parameters. Gallus gallus genome Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Ensembl 74 was used as control in the orthology assign- Leipzig 04103, Germany. 6Department of Genetics and Evolutionary Biology, 7 ment. Orthologous regions from each of the species were University of São Paulo, São Paulo, SP 05508-090, Brazil. Adaptive Evolutionary Genomics, Institute for Biochemistry and Biology, University aligned [86] to the reference UCNE and the number of Potsdam, Potsdam 14469, Germany. mismatches between the UCNE and the target genomes were determined (Additional file 1: Note: Ultra-conserved Received: 13 February 2015 Accepted: 1 July 2015 non-coding elements analysis).

Data availability References 1. Bunce M, Worthy TH, Phillips MJ, Holdaway RN, Willerslev E, Haile J, et al. Assembly, raw DNA, and RNA sequencing reads have The evolutionary history of the extinct ratite moa and New Zealand been deposited in the European Nucleotide Archive under Neogene paleogeography. Proc Natl Acad Sci U S A. 2009;106:20646–51. the BioProject with accession number: PRJEB6383. 2. Iviartin GR. Sensory capacities and the nocturnal habit of owls (Strigiformes). IBIS. 1986;128:266–77. HOX Cluster annotation files were deposited on [87] 3. Corfield JR, Parsons S, Harimoto Y, Acosta ML. Retinal anatomy of the New and [88]. Zealand kiwi: structural traits consistent with their nocturnal behavior. Anat UCNEs multiple fasta files and analysis have been de- Rec (Hoboken). 2015;298:771–9. 4. Gerkema MP, Davies WI, Foster RG, Menaker M, Hut RA. The nocturnal posited on [89]. bottleneck and the evolution of activity patterns in mammals. Proc Biol Sci. The kiwi FIBIN sequence was deposited in GenBank 2013;280:20130508. under BankIt 1821198 FIBIN KR364000. 5. Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11:R116. 6. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an Additional file empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1:18. Additional file 1: Supplementary Material contains Supplementary 7. Zhang G, Li C, Li Q, Li B, Larkin DM, Lee C, et al. Comparative genomics Figs. S1–S15, Supplementary Tables S1–S17, Supplementary Note, reveals insights into avian genome evolution and adaptation. Science. – and Supplementary References. 2014;346:1311 20. 8. International Chicken Genome Sequencing C. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate Abbreviations evolution. Nature. 2004;432:695–716. bp: base pair; CDD: Conserved domain database; CD-HIT: Cluster database at 9. Warren WC, Clayton DF, Ellegren H, Arnold AP, Hillier LW, Kunstner A, et al. high identity with tolerance; Gb: Giga base pairs; GO: Gene ontology; The genome of a songbird. Nature. 2010;464:757–62. GPCR: G protein-coupled receptor; H: Shannon entropy; HMM: Hiden markov 10. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, et al. MAKER: an model; kb: kilo base pairs; LRT: Likelihood ratio test; Mb: Mega base pairs; easy-to-use annotation pipeline designed for emerging model organism ML: Maximum likelihood; NJ: Neighbor joining; OR: Olfactory receptor; genomes. Genome Res. 2008;18:188–96. PCR: Polymerase chain reaction; TM: Transmembrane; UCNE: Ultra-conserved 11. Kondrashov FA. Gene duplication as a mechanism of genomic adaptation non-coding element. to a changing environment. Proc Biol Sci. 2012;279:5048–57. 12. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, et al. TreeFam: a Competing interests curated database of phylogenetic trees of gene families. Nucleic The authors declare no competing financial interests. Acids Res. 2006;34:D572–80.

274 Le Duc et al. Genome Biology (2015) 16:147 Page 14 of 15

13. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI. GigaDB: 38. Mitchell KJ, Llamas B, Soubrier J, Rawlence NJ, Worthy TH, Wood J, et al. promoting data dissemination and reproducibility. Database (Oxford). Ancient DNA reveals elephant birds and kiwi are sister taxa and clarifies 2014;2014:bau018. ratite bird evolution. Science. 2014;344:898–900. 14. Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. 39. Corfield JR, Eisthen HL, Iwaniuk AN, Parsons S. Anatomical specializations for Nucleic Acids Res. 2013;41:D48–55. enhanced olfactory sensitivity in kiwi, Apteryx mantelli. Brain Behav Evol. 15. Wilgenbusch JC, Swofford D. Inferring evolutionary trees with PAUP*. Curr 2014;84:214–26. Protoc Bioinformatics. 2003;Chapter 6:Unit 6 4. 40. Niimura Y, Nei M. Extensive gains and losses of olfactory receptor genes in 16. Hughes AL, Friedman R. Genome size reduction in the chicken has mammalian evolution. PLoS One. 2007;2, e708. involved massive loss of ancestral protein-coding genes. Mol Biol Evol. 41. Hasin-Brumshtein Y, Lancet D, Olender T. Human olfaction: from genomic 2008;25:2681–8. variation to phenotypic diversity. Trends Genet. 2009;25:178–84. 17. Zhan X, Pan S, Wang J, Dixon A, He J, Muller MG, et al. Peregrine and saker 42. Steiger SS, Fidler AE, Kempenaers B. Evidence for increased olfactory receptor falcon genome sequences provide insights into evolution of a predatory gene repertoire size in two nocturnal bird species with well-developed lifestyle. Nat Genet. 2013;45:563–6. olfactory ability. BMC Evol Biol. 2009;9:117. 18. Huang Y, Li Y, Burt DW, Chen H, Zhang Y, Qian W, et al. The duck genome 43. Preston GM. Cloning gene family members using PCR with degenerate and transcriptome provide insight into an avian influenza virus reservoir oligonucleotide primers. In: White BA (ed.) PCR cloning protocols: from species. Nat Genet. 2013;45:776–83. molecular cloning to genetic engineering; In series: Methods in 19. Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW. molecular biology (Clifton, N.J.) 67; Humana Press: 1997 pg 433-49. ISBN Extensive error in the number of genes inferred from draft genome 0896034436 assemblies. PLoS Comput Biol. 2014;10, e1003998. 44. Liu S, Wei W, Chu Y, Zhang L, Shen J, An C. De novo transcriptome 20. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene analysis of wing development-related signaling pathways in Locusta ontology: tool for the unification of biology. The Gene Ontology migratoria manilensis and Ostrinia furnacalis (Guenee). PLoS One. Consortium. Nat Genet. 2000;25:25–9. 2014;9, e106770. 21. Zakon HH, Jost MC, Lu Y. Expansion of voltage-dependent Na+ channel 45. Steiger SS, Kuryshev VY, Stensmyr MC, Kempenaers B, Mueller JC. A gene family in early tetrapods coincided with the emergence of comparison of reptilian and avian olfactory receptor gene repertoires: terrestriality and increased brain complexity. Mol Biol Evol. 2011;28:1415–24. species-specific expansion of group gamma genes in birds. BMC Genomics. 22. Luxey M, Jungas T, Laussu J, Audouard C, Garces A, Davy A. Eph:ephrin-B1 2009;10:446. forward signaling controls fasciculation of sensory and motor axons. Dev 46. Morrison SS, Pyzh R, Jeon MS, Amaro C, Roig FJ, Baker-Austin C, et al. Impact of Biol. 2013;383:264–74. analytic provenance in genome analysis. BMC Genomics. 2014;15:S1. 23. Patel K, Nittenberg R, D’Souza D, Irving C, Burt D, Wilkinson DG, et al. 47. Margulies DH, Natarajan K, Rossjohn J, McCluskey J. Fundamental Expression and regulation of Cek-8, a cell to cell signalling receptor in Immunology. 7th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott developing chick limb buds. Development. 1996;122:1147–55. Williams & Wilkins; 2012. p. 511. 24. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol 48. Shannon CE. The mathematical theory of communication. Bell System Tech Evol. 2007;24:1586–91. J. 1948;27:379–243. 623–56. 25. Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of 49. Litwin S, Jores R. Shannon information as a measure of amino acid diversity. storytelling: gene ontology categories and the importance of validating In: Perelson AS, Weisbuch G, editors. Theoretical and experimental insights genomic scans. Mol Biol Evol. 2012;29:3237–48. into immunology, vol. 66. NATO ASI Series. Berlin: Springer Berlin 26. Torii M, Kojima D, Okano T, Nakamura A, Terakita A, Shichida Y, et al. Two Heidelberg; 1992. p. 279–87. isoforms of chicken melanopsins show blue light sensitivity. FEBS Lett. 50. McNab BK. Resource use and the survival of land and freshwater vertebrates 2007;581:5327–31. on oceanic islands. American Naturalist. 1994;144:643–60. 27. Martin GR, Wilson KJ, Martin Wild J, Parsons S, Fabiana Kubke M, Corfield J. 51. Cooper A, Cooper RA. The Oligocene bottleneck and New Zealand Kiwi forego vision in the guidance of their nocturnal activities. PLoS One. biota: genetic record of a past environmental crisis. Proc Biol Sci. 2007;2, e198. 1995;261:293–302. 28. Osorio D, Vorobyev M. A review of the evolution of animal colour vision 52. Grzimek B, Schlager N, Olendorf D, McDade MC. Grzimek’s animal life and visual communication signals. Vision research. 2008;48:2042–51. encyclopedia. Gale: Gale, MI; 2004. 29. Beukers MW, Kristiansen I, IJzerman AP, Edvardsen I. TinyGRAP database: a 53. Tanaka M. Molecular and evolutionary basis of limb field specification and bioinformatics tool to mine G-protein-coupled receptor mutant data. Trends limb initiation. Dev Growth Differ. 2013;55:149–63. Pharmacol Sci. 1999;20:475–7. 54. Pascual-Anaya J, D’Aniello S, Kuratani S, Garcia-Fernandez J. Evolution of 30. Jansen JJ, Mulder WR, De Caluwe GL, Vlak JM, De Grip WJ. In vitro Hox gene clusters in deuterostomes. BMC Dev Biol. 2013;13:26. expression of bovine opsin using recombinant baculovirus: the role of 55. Dimitrieva S, Bucher P. UCNEbase–a database of ultraconserved non-coding glutamic acid (134) in opsin biosynthesis and . Biochim elements and genomic regulatory blocks. Nucleic Acids Res. 2013;41:D101–9. Biophys Acta. 1991;1089:68–76. 56. Woolfe A, Elgar G. Organization of conserved elements near key 31. Capra V, Veltri A, Foglia C, Crimaldi L, Habib A, Parenti M, et al. Mutational developmental regulators in vertebrate genomes. Adv Genet. 2008;61:307–38. analysis of the highly conserved ERY motif of the thromboxane A2 receptor: 57. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, alternative role in G protein-coupled receptor signaling. Mol Pharmacol. et al. In vivo enhancer analysis of human conserved non-coding sequences. 2004;66:880–9. Nature. 2006;444:499–502. 32. Schulz A, Schoneberg T, Paschke R, Schultz G, Gudermann T. Role of the 58. Bell SM, Schreiner CM, Waclaw RR, Campbell K, Potter SS, Scott WJ. Sp8 is third intracellular loop for the activation of gonadotropin receptors. Mol crucial for limb outgrowth and neuropore closure. Proc Natl Acad Sci U S A. Endocrinol. 1999;13:181–90. 2003;100:12195–200. 33. Vogel R, Mahalingam M, Ludeke S, Huber T, Siebert F, Sakmar TP. Functional 59. Gestri G, Osborne RJ, Wyatt AW, Gerrelli D, Gribble S, Stewart H, et al. role of the “ionic lock”–an interhelical hydrogen-bond network in family A Reduced TFAP2A function causes variable optic fissure closure and heptahelical receptors. J Mol Biol. 2008;380:648–55. retinal defects and sensitizes eye development to mutations in other 34. Ebrey T, Koutalos Y. Vertebrate photoreceptors. Prog Retin Eye Res. morphogenetic regulators. Hum Genet. 2009;126:791–803. 2001;20:49–94. 60. Reid B, Williams GR. The kiwi. In: Kuschel G, editor. Biogeography and 35. Schoneberg T, Schulz A, Biebermann H, Hermsdorf T, Rompler H, Sangkuhl Ecology in New Zealand, vol. 27. The Hague: Springer Netherlands; 1975. p. K. Mutant G-protein-coupled receptors as a cause of human diseases. 301–30. Pharmacol Ther. 2004;104:173–206. 61. van Tuinen M, Sibley CG, Hedges SB. Phylogeny and biogeography of ratite 36. Tao YX. Inactivating mutations of G protein-coupled receptors and diseases: birds inferred from DNA sequences of the mitochondrial ribosomal genes. structure-function insights and therapeutic implications. Pharmacol Ther. Mol Biol Evol. 1998;15:370–6. 2006;111:949–73. 62. Hackett SJ, Kimball RT, Reddy S, Bowie RC, Braun EL, Braun MJ, et al. A 37. Vassart G, Costagliola S. G protein-coupled receptors: mutations and phylogenomic study of birds reveals their evolutionary history. Science. endocrine diseases. Nat Rev Endocrinol. 2011;7:362–72. 2008;320:1763–8.

275 Le Duc et al. Genome Biology (2015) 16:147 Page 15 of 15

63. Worthy TH, Worthy JP, Tennyson AJD, Salisbury SW, Hand SJ, Scofield 89. Kiwi Annotated UCNEs. Available at: https://bioinf.eva.mpg.de/KIWI-UCNEs/ RP. Miocene fossils show that kiwi (Apteryx, Apterygidae) are probably 90. Ellegren H, Smeds L, Burri R, Olason PI, Backstrom N, Kawakami T, et al. The not phyletic dwarves. In: Göhlich UB, Kroh A, editors. Proceedings of genomic landscape of species divergence in Ficedula flycatchers. Nature. the 8th International Meeting Society of Avian Paleontology and 2012;491:756–60. Evolution. Vienna, 2012, Verlag des Naturhistorischen Museums in Wien, 91. DalloulRA,LongJA,ZiminAV,AslamL,BealK,LeBlombergA,etal. Vienna; 2013. p. 63–80. Multi-platform next-generation sequencing of the domestic turkey 64. Jacobs GH. Losses of functional opsin genes, short-wavelength cone (Meleagris gallopavo): genome assembly and analysis. PLoS Biol. photopigments, and color vision–a significant trend in the evolution of 2010;8:1–21. mammalian vision. Vis Neurosci. 2013;30:39–53. 65. Striedter GF. Principles of brain evolution. Sinauer Associates Inc.,U.S. ISBN: 978-0-87893-820-9. 2004/2005 66. Walls GL. The vertebrate eye and its adaptive radiation. Oxford: Cranbook Institute of Science; 1942. 67. Crompton AW, Taylor CR, Jagger JA. Evolution of homeothermy in mammals. Nature. 1978;272:333–6. 68. McNab BK. Metabolism and temperature regulation of kiwis (Apterygidae). The Auk. 1996;113:687–92. 69. Sales J. The endangered kiwi: a review. Folia Zoologica Praha. 2005;54:1. 70. Brooke ML, Hanley S, Laughlin SB. The scaling of eye size with body mass in birds. Proc Biol Sci. 1999;266:405–12. 71. Hall MI, Kamilar JM, Kirk EC. Eye shape and the nocturnal bottleneck of mammals. Proc Biol Sci. 2012;279:4962–8. 72. Cunningham S, Castro I, Alley M. A new prey‐detection mechanism for kiwi (Apteryx spp.) suggests convergent evolution between paleognathous and neognathous birds. J Anat. 2007;211:493–502. 73. Gilad Y, Przeworski M, Lancet D. Loss of olfactory receptor genes coincides with the acquisition of full trichromatic vision in primates. PLoS Biol. 2004;2, E5. 74. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34:W435–9. 75. Gertz EM, Yu YK, Agarwala R, Schaffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol. 2006;4:41. 76. Prüfer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, et al. FUNC: a package for detecting significant associations between gene sets and ontological annotations. BMC Bioinform. 2007;8:41. 77. De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006;22:1269–71. 78. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, et al. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics. 2009;25:1091–3. 79. Kall L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics. 2005;21:i251–7. 80. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9. 81. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80. 82. Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–8. 83. Yang Z. Computational Molecular Evolution. Oxford: Oxford University Press; 2006. 84. Prohaska SJ, Fried C, Flamm C, Wagner GP, Stadler PF. Surveying phylogenetic footprints in large gene clusters: applications to Hox cluster duplications. Mol Phylogenet Evol. 2004;31:581–604. 85. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. 86. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7. 87. Kiwi Genome. Available at: http://www.bioinf.uni-leipzig.de/~studla/ KIWI-HOX/. 88. Kiwi Annotated HOX Cluster. Available at: https://bioinf.eva.mpg.de/ KIWI-HOX/

276 Apêndice A.4.

Cópia pessoal do manuscrito “Heterogeneity of dN/dS Ratios at the Classical HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”: Journal of Molecular Evolution (2015) 82(1): 38-50. O artigo em sua versão final (pós-processamento editorial) não está liberado para ser re-distribuído a partir deste documento, uma vez que o mesmo não é “Open Access”. Portanto, dis- ponibilizo a versão aceita para publicação, porém sem a formatação da revista – esta encontra-se disponível pelo DOI 10.1007/s00239-015-9713-9. Esse artigo é o resultado do meu trabalho de mestrado, que foi aprimorado ao longo do meu doutorado. Ele tem elementos em comum com o artigo apre- sentado no apêndice A.2, pois em ambos investigamos as unidades de seleção nos genes HLA: aqui, linhagens alélicas; no outro artigo (A.2), supertipos. Neste trabalho, fui orientada por Diogo Meyer, que concebeu as ideias ori- ginais do projeto. Ambos desenvolvemos as metodologias a serem adotadas ao longo do projeto. Executei todas as análises e redigi o manuscrito juntamente com DM. RSF e eu fizemos a detecção de sequências recombinantes e todos os co-autores participaram na discussão e verificação dos resultados.

277 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

Noname manuscript No. (will be inserted by the editor)

Heterogeneity of dN/dS ratios at the classical HLA class I genes over divergence time and across the allelic phylogeny

Bárbara Domingues Bitarello, Rodrigo dos Santos

Francisco, Diogo Meyer

Abstract The classical class I HLA loci of humans show an excess of nonsynonymous with respect to

synonymous substitutions at codons of the antigen recognition site (ARS), a hallmark of adaptive evolution.

Additionally, high polymporphism, linkage disequilibrium and disease associations suggest that one or more

balancing selection regimes have acted upon these genes. However, several questions about these selective

regimes remain open. First, it is unclear if stronger evidence for selection on deep timescales is due to changes

in the intensity of selection over time or to a lack of power of most methods to detect selection on recent

timescales. Another question concerns the functional entities which define the selected phenotype. While most

analysis focus on selection acting on individual alleles, it is also plausible that phylogenetically defined groups

of alleles ("lineages") are targets of selection. To address these questions we analyzed how dN/dS (ω) varies

with respect to divergence times between alleles and phylogenetic placement (position of branches). We find

that ω for ARS codons of class I HLA genes increases with divergence time and is higher for inter-lineage

branches. Throughout our analyses, we used non-selected codons to control for possible effects of inflation of ω

associated to intra-specific analysis, and showed that our results are not artifactual. Our findings indicate the

importance of considering the timescale effect when analysing ω over a wide spectrum of divergences. Finally,

our results support the divergent allele advantage model, whereby heterozygotes with more divergent alleles

have higher fitness than those carrying similar alleles.

Keywords balancing selection, HLA, MHC, dN/dS, allelic lineages, antigen recognition site, divergent allele

advantage

Address: Departament of Genetics and Evolutionary Biology, University of São Paulo, Rua do Matão, 277, São Paulo. Tel.: +55(11)3091-8092 E-mail: [email protected]

278 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

2

1 Introduction

MHC class I and II classical molecules are cell-surface glycoproteins which mediate presentation of peptides

to T-cell receptors, and play a key role in triggering adaptive immune responses when the bound peptide is

recognized as foreign (Klein and Sato 2000). In humans, they are coded by HLA class I (HLA-A, -B, and -C)

and II (HLA-DR, -DQ, and -DP) classical genes. The class I and class II HLA classical genes are the most

polymorphic in the human genome (Meyer and Thomson 2001), and knowledge about their function in the

immune response supports a role for balancing selection in driving the diversity patterns at these loci.

A number of findings suggest MHC genes have experienced balancing selection: unusually high level of

heterozygosity with respect to neutral expectations (Hedrick and Thomson 1983); existence of trans-species

polymorphisms (Takahata and Nei 1990); high levels of linkage disequilibrium (Huttley et al 1999); site

frequency spectra with excess of common variants (Garrigan and Hedrick 2003); high levels of identity-by-descent

compared to genomic averages (Albrechtsen et al 2010); positive correlation between HLA polymorphism and

pathogen diversity (Prugnolle et al 2005), and significant associations of HLA alleles with the course of infectious

diseases (e.g. Apps et al 2013). Information on the crystal structure of MHC molecules (Bjorkman et al 1987)

allowed the identification of a specific set of amino acids that make up the antigen recognition site (ARS),

which determines the peptides that the molecule is able to bind (Bjorkman et al 1987; Chelvanayagam 1996).

The codons of the ARS were shown to have increased nonsynonymous substitution rates (Hughes and Nei

1988, 1989), consistent with the hypothesis that adaptive evolution at HLA loci is driven by peptide binding

properties.

Several models of selection are compatible with balancing selection at MHC genes. Heterozygote advantage

assumes that heterozygotes have higher fitness values because they are able to mount an immune response

to a greater array of pathogens, an idea originally proposed by Doherty and Zinkernagel (1975), who showed

that mice which were heterozygous for the MHC had increased immunological surveillance. Heterozygote

advantage has received support from experiments in semi-natural populations of mice (Penn et al 2002), which

show increased resistance of heterozygotes to multiple-strain infection, and through the finding that among

humans infected with HIV, those which are heterozygous for HLA genes have slower progression to AIDS

(reviewed in Dean et al 2002). Heterozygote advantage has also received support from substitution rate studies

(Hughes and Nei 1988, 1989) as well as simulation-based studies (e.g. Takahata and Nei 1990). A second

model for balancing selection at MHC genes is negative frequency dependent selection (or apostatic selection),

according to which rare variants have a selective advantage over common ones, because pathogens are more

likely to evade presentation by common molecules (Slade and McCallum 1992). Although both are biologically

compelling, decades of research have shown that most forms of summarizing genetic observation are incapable

of differentiating these two modes of selection (Hughes and Nei 1989; Meyer and Thomson 2001; Spurgin and

Richardson 2010), and the functional insights for the action of heterozygote advantage at least partially explain

why it is usually favored over negative frequency dependence (Richman 2000).

279 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

3

A third model involves selective pressures that are heterogeneous over space and/or time, favoring different

alleles in different temporal or geographic compartments, and thus resulting in an overall increase in diversity

at MHC loci. This model has been shown to be capable of accounting for features of HLA variation (Hedrick

2002). Many studies have investigated this model by comparing the degree of population differentiation at MHC

and putatively neutral loci, with the expectation being that selection that is geographically heterogeneous will

result in increased differentiation at HLA genes. As reviewed in Spurgin and Richardson (2010) , the results are

mixed, and interpretation is hampered due to differences in the mutational models underlying the evolution

of HLA genes and loci used as neutral controls. Although the specific form of selection acting on MHC genes

remains an open question, the fact that these genes have evolved in a non-neutral way and are under balancing

selection is an undisputed finding, which is robust to complications introduced by demographic history (Harris

and Meyer 2006; Hughes and Yeager 1998; Garrigan and Hedrick 2003).

While studies of MHC have documented convincingly a role of selection, certain questions remain unresolved

in the context of variation of the human MHC genes (termed HLA loci). The first of these concerns the

"timescale" of selection: while most tests for selection have provided strong evidence for selection at classical

HLA class I genes in in deep timescales, there is comparatively less support for selection at recent timescales

(Garrigan and Hedrick 2003). It has proved difficult to tease apart the possibility that selection differs across

timescales from reduced statistical power of tests for recent selection, and thus the question of the timescale

of selection on HLA genes remains open.

The second question concerns targets of selection, i.e, which biological entity is targeted by selection in

HLA class I genes: individual alleles or groups of similar alleles? Classical MHC genes have many alleles,

which can be hierarchically classified into groups of alleles which reflect the phylogenetic relatedness and

shared functional attributes of these alleles. Wakeland et al (1990) proposed a mechanism coined "divergent

allele advantage", which is a specific case of heterozygote advantage, according to which the fitness values

of heterozygotes are proportional to the degree of divergence between the alleles they carry. This model was

motivated by the observation that, in MHC class II murine genes, alleles from a given allelic lineage often differ

by only minor structural variations in the ARS, while alleles in different lineages have functionally different

ARS. The open question is whether individual alleles or allelic lineages are the main targets of selection for

HLA genes. Although nucleotide diversity intra-lineages exceeds genome-wide averages, inter-lineage diversity

is substantially higher than intra (Takahata and Satta 1998). This raises the question of whether intra-lineage

variation is under a different mode and intensity of selection with respect to differences between lineages.

We address these questions by analysing the temporal and phylogenetic dynamics of dN/dS (or ω) for ARS

codons at the class I classical loci (HLA-A, -B and -C) loci, using both pairwise and phylogenetic approaches.

These loci are all highly polymorphic and there is an abundance of data available for most exons of their coding

sequence, which makes our analyses of non-ARS codons (as a control) possible. Our pairwise comparisons of

alleles show that more divergent pairs show higher ω for ARS codons than closely related pairs of alleles.

The phylogenetic analyses support the hypothesis that selection is stronger for inter-lineage branches (i.e,

those connecting two clades from the same lineage, as opposed to those who do not), and also which are

280 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

4

internal to the phylogeny (when compared to terminal branches), provided that a bias toward overestimating

ω for recent divergence is taken into account (Rocha et al 2006). Although evidence for balancing selection

on the intra-lineage scale is weaker than on the inter-lineage scale, our findings show that there is statistical

support for deviation from a regime of neutrality for intra-lineage branches of the allelic tree. We conclude

that intra-lineage divergence has also evolved under a regime of balancing selection, and that inter-lineage

divergence bears an even stronger signature of selection.

2 Materials and Methods

2.1 Data

Alignments for HLA-A, HLA-B and HLA-C were obtained from the IMGT/HLA Database (Robinson et al

2013). All dN/dS estimates and related analyses were implemented in CODEML (PAML package, Yang 2007).

First codon position was considered to be the first codon of exon 2, as indicated by annotation on IMGT

alignments. Our initial data sets were comprised of complete coding sequences, i.e, exons 2-7 (for HLA-A and

HLA-C) and 2-6 (HLA-B). These data sets were used for the site models (SM) approach. For the pairwise and

branch model (BM) approaches, we used two datasets: one with 48 ARS codons (Chelvanayagam 1996) and

the other, referred to as "non-ARS", consisting of the remaining codons (Table 1).

In order to be able to use the methods available in CODEML we restricted our analysis to HLA alleles

which had complete coding sequences, no stop codons, were expressed in the cell surface and only differed with

respect to others by base changes (i.e. no insertions or deletions). Alleles with mutations putatively linked to

low or absent cell surface expression were also remove from analyses. The non-ARS data sets were used for

estimation of dS, used in the pairwise approach as a proxy for allelic divergence, as an internal control for ARS

analyses. For the branch models, further pruning of the phylogenetic trees was done, as described below.

2.2 Trees and intragenic recombination detection

Phylogenetic trees Complete alignments, described above, were used to generate NJ trees for each gene (Saitou

and Nei 1987). The program NEIGHBOR, from the PHYLIP package (Felsenstein 1989) was used with the

F84 method, k (transition/transversion ratio) = 2 and empirical base frequencies for the distance matrices

obtained in DNADIST (Felsenstein 1989).

Recombination detection. Intragenic recombinants were detected by applying RDP3 (Martin et al 2010) to

the complete alignments, followed by manual inspection. The RDP3 program combines several non-parametric

recombination detection methods in sequence data, and we used 6 independent tests for recombination detection:

RDP; Chimaera; Maxchi; GENECONV, BootScan and SiScan for recombination detection (see Martin et al.

2010 and references therein). Window size was adjusted to 100 for BootScan and SiScan, and to 15 for RDP.

The number of variable sites per window was adjusted to 35 and 30 for Maxchi and Chimaera, respectively.

281 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

5

These sizes were chosen based on a test alignment we provided to the software, in which parental and daughter

HLA-B sequences were known a priori. Based on this training set, we adjusted the parameters as described, and

for other parameters default values were used. Since these six tests are mostly independent, and have different

strengths, we considered a recombination event to be significant when p < 0.05 in at least 3 of the above

methods, which means we were somewhat conservative in the removal of recombinants from the datasets. "Trace

evidence" cases, i.e, those that bear a signal of recombination but are technically not statistically significant,

were kept in the data sets. Following this initial procedure, we visually inspected the filtered alignments for

the detection of additional recombinant sequences. This procedure generated tow data sets for each locus, one

with recombinants and one without ("recombinant" (R), and "non-recombinant" (NR), respectively, Table 1).

Clade Filter For the branch models, we used t (expected number of nucleotide substitutions per codon) matrices

obtained in pairwise analyses of the non-recombinant non-ARS data sets as input for NEIGHBOR. The trees

were visualized for manual pruning and labeling in Mesquite (v2.75, http://mesquiteproject.org/). We imposed

that alleles from a given HLA lineage (as defined by the standard HLA nomenclature, which identifies lineage

membership by the first field of an allele’s name) had to group together in a clade, and alleles which did not

group in such manner were manually pruned from trees in order to fulfill this "clade membership criterium".

The effect of this filtering on inclusion of alleles is presented in Figure S1 in the Online Resource 1. After

pruning of the trees, the corresponding pruned alleles were removed from the NR data sets and these reduced

data sets were used for the branch model analyses. Table 1 shows the number of alleles used for each analysis.

2.3 CODEML analyses

Branch models (BM) With the pruned data sets we compared branch models 0 (one ω for all branches) and

2 (two or more categories of branches with independent ω) from CODEML. We provided CODEML with a

topology based on the non-ARS pruned data set, using branch lengths as starting points for ML estimation

(fix_blength=1). For all CODEML analyses (BM, site models and pairwise), the Goldman and Yang (1994)

model was used for estimation of substitution rates. Other parameters defined in the control file were as

follows: option F3x4 for codon frequency estimation, κ = 2 and ω = 0.4 as initial values. Tables S14-S16

(Online Resource 1) show likelihood convergence for the branch models, assuming different initial parameter

values and codon frequency estimation methods. BM analyses were performed solely for the NR data sets (see

tables 2 and 3). Branch models 0 (one omega for all branches) and 2 (two or more omegas) were compared,

where branches were labeled either as "intra" or "inter" lineages (Figure 3), or as "terminal" or "internal". The

two models were compared via a likelihood ratio test (LRT) with one degree of freedom (see below). BM

analyses were performed only for the NR (and pruned) datasets. See Figure 3 for an schema of the labels

applied to the trees used in the BM analyses.

Site models (SM) For the SM approach, the clade filter was not applied, which resulted in minor differences

between this data set and the other two (pairwise and branch models approach, see Table 1). We used the

282 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

6

site models from CODEML to identify codons with ω > 1 and thus to test if ARS codons bear evidence for

adaptive evolution. M0 (one ratio) assumes the existence of only one ω ratio for all codons, while M1 (neutral)

assumes the existence of two categories of sites, one with ω1 = 1 (sites evolving in a neutral fashion) and

the other with ωo < 1 (sites evolving under purifying selection), while M2 (selection) adds an extra category

to M1, where ω2 > 1, corresponding to sites with evidence for adaptive evolution. M7 (beta) is a flexible

null model where the value is sampled from a beta distribution, where ω0 < 1, and 0 < ω < 1 , while M8

adds an extra category to M7, ω2, which is estimated from the data (Yang 2006). Codons with posterior probabilities P > 0.95 of ω > 1 in the Bayes Empirical Bayes (BEB) (Yang et al 2005) approach implemented

in CODEML were considered to have significant evidence for adaptive evolution, following criteria described

elsewhere (Yang and Swanson 2002; Yang et al 2005). The ARS codon classification proposed by Bjorkman

et al. (1987) is referred to as BJOR, while the "peptide binding environments", i.e, the amino acid residues in

a fixed neighborhood of the peptide binding residues known from crystal structure complexes (which provide

a less restrictive description of the antigen binding sites), are referred to as CHEV (Chelvanayagam 1996).

Finally, the list of codons in HLA genes with evidence of ω > 1 from Yang and Swanson (2002) is referred to as

YANG (Figure 1 and Online Resource 1, Table S9). M1 vs M2 and M7 vs M8 models were compared through

a LRT with two degrees of freedom. Tables S3-S8 (Online Resource 1) show likelihoods obtained when altering

initial CODEML conditions for the SM analyses. SM analyses were performed for R and NR data sets.

Codons with P > 0.95 for ω > 1 in M8 (34 in total) were combined for the three loci, and the R and NR

data sets, and compared to CHEV, BJOR and YANG. Of these 34 codons, only one was outside of the exons

2 and 3 range (codon 305), which is where all ARS codons are located. Figure 1 shows the overlap between

the codons defined as making up the ARS in the BJOR and CHEV classifications, as well as those idenfied as

under selection in the YANG set of codons and our analyses.

In order to evaluate if our site model analyses were robust to features of the estimation method, the analyses

were repeated with DATAMONKEY, from the HYPHY package (Pond et al, 2005). The substitution model

used for construction of the NJ tree was HKY85 (very closely related to F84, used for CODEML analyses).

Two criteria for detection significant dN/dS > 1 were considered: SLAC and FEL (both with significance level

of 0.1), with the former being the most conservative criterion available in the package. Tables S10-S12 report

the overlap of sites with evidence for dN/dS > 1 for BEB (CODEML), SLAC and FEL.

LRT When comparing two nested models the LRT test statistic is given by doubling the log likelihood differece

between the more parameter rich model and the less parameter rich model. The difference in parameter number

yields the degrees of freedom. It is expected that the use of a chi-square distribution for significance evaluation

of this test is a conservative approach (Yang 2006). Both site models and branch models comparisons were

performed through LRTs.

Breslow-Day Test In order to compare ARS and non-ARS codons with respect to the distribution of synonymous

and nonsynoymous changes within and between lineages (or for internal or terminal branches), we used a

contingency table approach similar to the one described in Templeton (1996). We estimated the synonymous

283 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

7

(S) and non-nonsynonymous (N ) changes on each branch in CODEML, using the branch models. Next we

counted N (nonsynonyous changes) and S (synonymous changes) for intra/inter or terminal/internal branches

for each locus, and for ARS and non-ARS codons (Table 5).

We defined the odds ratio (OR) as:

N S intra · inter N S inter · intra

, and used a Breslow-Day test for homogeneity of OR to test the hypothesis that contingency tables from

ARS and non-ARS codons have the same OR. We applied the same test to internal/terminal branches. Data

from the three loci were combined into the same analysis to increase power.

Pairwise approach We also performed analyses where statistics were estimated in comparisons between all pairs

of alleles (pairwise analyses, see Table 1) using runmode=-2 in CODEML. This approach does not require a

phylogenetic tree. Because IMGT/HLA nomenclature allows information about allelic lineages to be known

without a tree, pairs were also classified as intra or inter-lineage. Correlations between allelic divergence and

omega values were tested with a Mantel Test using Pearson’s correlation index (Online Resource 1, Table S13).

We obtained quantiles of the dSnon−ARS distribution and divided pairwise values according to these quantiles (Online Resource 1, Table S1 for non-ARS data set and Table 4 in main text for ARS data set). Differences

in mean ω values for "intra" and "inter" comparisons were tested for significance by a Wilcoxon rank sum test

(Figure 2).

2.4 Allele frequencies of HLA SNPs in the 1000 Genomes

The IMGT/HLA database contains all HLA alleles described to date, regardless of their population frequencies.

Therefore, it is possible that rare variants can contribute disproportionately to patterns identified in the dN/dS

analyses. To address this concern, we investigated patterns of variation at the HLA loci in a population (Yoruba,

YRI) from the 1000 Genomes Project (1000G), for which frequency of alleles at specific SNP positions is

available (N = 88 individuals).

To test for a possible enrichment of rare variants in the IMGT data we compared patterns of variation

seen in the IMGT and 1000G phase I data (The 1000 Genomes Project Consortium, 2012). To this end, we

defined a set of sites, for each locus, which were variable in our IMGT-derived data sets (referred to as the

"OVERALL" set of sites). Next, we classified these sites as variable only within a single lineage ("INTRA"), or

variable in more than one lineage ("INTER"). For each site, we converted the positions within the HLA locus

into a genomic coordinate for H. sapiens (hg19).

Next, we verified if these positions are polymorphic in the 1000G Phase I low-coverage dataset (ftp://ftp.1000genomes.ebi.ac.uk/v

and recorded the minor allele frequency in the YRI population.

284 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

8

3 Results

3.1 Evidence for selection and assessment of recombination

Before investigating how ω varies over time and phylogenetic context, we tested (a) whether selection is

detectable in our data set with pairwise comparisons and phylogenetic dN/dS approaches; (b) if the presence

of HLA alleles resulting from intragenic recombination influences our inferences; and (c) if there is agreement

between the ARS codons defined by crystal structure (Bjorkman et al 1987; Chelvanayagam 1996) and the

codons inferred to have ω > 1 in our data set. The results to these tests are pre-requisites for subsequent

analyses addressing the more specific hypotheses about heterogeneity in dN/dS estimated across the allelic

phylogeny and divergence time.

We quantified the mean pairwise dN/dS (ω), and found ωARS > 1 for all loci (Table 4). We used the

non-ARS codons from the same sequences as an internal control, and found that ωARS is 3.9 (HLA-A), 4.0

(HLA-B) and 3.2-fold (HLA-C) greater than ωnon ARS (Table 4). This effect is not driven by a subset of the − pairwise comparisons, since dN > dS for the majority (between 67 and 84%) of ARS pairwise comparisons, in

contrast to the non-ARS comparisons, where fewer than 7% show dN > dS (Table 4). Importantly, we find

that the result ωARS > ωnon ARS is due to increased dN (3.5 to 14-fold higher for ARS), and not to decreased dS − (0.5 to 2.8-fold higher for ARS, Table 4). Qualitatively similar results were obtained when we computed the

ratio of mean substitution rates, dN/dS (Table 4). These findings are robust to the presence of recombinants

(Online Resource 1, Table S1). Overall, our results document that pairwise comparison of alleles provides

strong support for adaptive evolution on ARS codons, as expected.

Evidence for adaptive evolution in ARS codons was also strongly supported by phylogenetic methods

from CODEML (see Methods), where models allowing for selection (M2 and M8) in a subset of codons were

significantly favored over the neutral models M1 and M7 (Online Resource 1, Table S2; p < 0.01, LRT). Results

were robust to starting conditions for HLA-A and HLA-B (Online Resource 1, Tables S3-S6), and less so for

HLA-C (Online Resource 1, Tables S7 and S8).

We next quantified the overlap between codons we inferred to be under selection (using site models

from CODEML, "SM") and those defined as ARS based on structural analyses of HLA (Chelvanayagam

1996; Bjorkman et al 1987). Within exons 2 and 3 (which contain all ARS codons) we identified 33 codons

with significant ω > 1 for the M8 site model (see Methods and Table S2, Online Resource 1) in at least

one locus, of which 27 (82%) are contained within the set that forms the ARS according to the crystal

structure-based classification (Bjorkman et al 1987), 25 (76%) are contained within the peptide binding

environments (Chelvanayagam 1996), and 25 (76%) overlap with Yang and Swanson’s (2002) site models

approach to detect codons with ω > 1 in the three classical class I HLA loci (Figure 1 and Online Resource 1,

11 Table S9). The association between ARS and selected sites for all loci is highly significant (p < 10− , chi-square

test). There is extensive overlap between the two ARS classifications (Bjorkman et al 1987; Chelvanayagam

285 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

9

1996) (Figure 1) and we also find a high overlap of selected sites between the R and NR data sets for each

locus (27 out of 33) (Online Resource 1, Tables S10-S12).

Overall, our results show that: (a) the pairwise and phylogenetic site models methods implemented in

CODEML strongly support adaptive evolution on the ARS codons of HLA loci - as also described by Yang

and Swanson (2002) through site models; (b) there is an enrichment of codons with ω > 1 in the CHEV set

of codons (see Online Resource 1, Table S9, for the names given to the sets of codons), supporting the use of

this classification for our study; (c) although the results were robust to the presence of recombinants, a finding

consistent with simulation studies (Anisimova et al, 2003), the estimated values for ω appear to be sensitive

to the inclusion of recombinants. Therefore, where appropriate, in subsequent pairwise analyses, we contrast

results of non-recombinant (NR) and recombinant (R) datasets, while for the branch models we use the NR

data set exclusively.

In addition, the results of HLA-C, although following the same trend observed for HLA-A and HLA-B,

show that absolute divergence values for ARS codons are on average 1/2 of those observed for the other two

loci, both for dN and dS (Table 4). This result might be a reflect of the fact that HLA-C not only has an

antigen presentation function, but has a huge role in interactions with NK receptors (KIR) and that, unlike

HLA-A and HLA-B, all HLA-C allotypes form ligands for KIR receptors (Hilton et al 2015; Single et al 2007).

Because the KIR loci have been shown to evolve quite rapidly across primate species, plausibly faster than

their MHC class I ligands (Single et al, 2007), it is possible that this important selective pressure is responsible

for the lower substitution rates seen for the ARS of HLA-C, as well as for the lack of consistency observed in

ML estimates

3.2 The time-dependence of ω at HLA class I loci

Having confirmed that selection at ARS sites is detectable with pairwise comparisons and phylogenetic approaches,

we investigated if recent evolutionary change (accounting for differences among recently diverged alleles) shows

different signatures of selection with respect to changes that occurred over greater timescales. Our first approach

consisted in examining the distribution of ωARS as a function of the time since divergence between allele pairs. Our estimate of divergence time between allele pairs was based on the values of dS (estimated from non-ARS

codons) for each allele pair, thus avoiding statistical non-independence with ωARS. Because very recently diverged

alleles have low synonymous divergence (dSnon ARS ), the corresponding ωARS values were often undefined or − extremely large. We therefore followed a strategy adopted by Wolf et al (2009) to filter out the allele pairs with

ωARS > 5 (resulting in the removal of 1.1%, 1.4% and 3.9% of ω values for pairwise comparisons at HLA-A, -B, and -C, respectively).

Pairwise estimates show that ωARS increases as a function of divergence time (Table 4). Indeed, ωARS and

dSnon ARS are positively correlated (Online Resource 1, Table S13; rHLA A = 0.17, p < 0.001; rHLA B = 0.20, − − −

p < 0.001; rHLA C = 0.20, p < 0.001; Pearson, significance obtained by Mantel Test). Qualitatively similar − results were found for NR data sets and were robust to different correlation measures (Online Resource 1,

286 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

10

Table S13). We also compared the ω between allele pairs classified as intra and inter-lineage (Figure 2). For

all loci, the median value of ωARS is > 1 for the inter lineage contrasts, and < 1 for the intra-lineage contrasts, and the distribution of ω is significantly higher for inter-lineage contrasts (p < 0.001, Wilcoxon rank sum test;

Figure 2) of the ARS codons.

The above pairwise comparison approach suffers from the limitation that allele pairs with ω > 5 were

treated as missing data, possibly underestimating ω for recently diverged alleles. This prompted us to use a

phylogenetic model to contrast alleles at different levels of differentiation, which is more robust to the effects

of low differentiation between specific allele pairs. We compared a branch model that estimates a single ω for

all branches to one that estimates two values of ω (inter versus intra-lineage; terminal versus internal; see

Figure 3). For all loci we found higher ωARS for inter-lineage branches than for intra-lineage branches, although significance was not attained for these tests (Table 2). For the contrast between internal and terminal branches,

we found higher ωARS for internal branches at all loci and this result was statistically significant for HLA-C (Table 2).

Our results show that both pairwise comparisons and branch models indicate a heterogeneity of ω throughout

the diversification of HLA alleles, with higher ω values associated to contrasts between more divergent alleles

(pairwise approach) or to branches connecting different lineages or that are internal to the phylogeny (BM

approach), although the difference was not significant for the "intra-inter" contrasts.

3.3 Significantly more nonsynonymous changes inter-lineages at ARS codons

In this study we estimate ω for allele pairs or branches sampled within a single species, and over varying

timescales. Both these features imply in possible biases to the estimation of ω, which we now discuss.

Kryazhimskiy and Plotkin (2008) used analytical and simulation approaches to show that under positive

selection the behavior of ω within a single population is not a monotonic function of the intensity of selection, so

that ω intra a population can be low, even under positive selection. This occurs because, when an advantageous

nonsynonymous variant is fixed in a population, nonsynonymous variation can be decreased due to the

homogeneity generated by the selective sweep. However, this scenario clearly does not apply to HLA genes,

where balancing selection maintains multiple nonsynonymous polymorphisms simultaneously segregating within

a population, contributing to ω > 1.

Another challenge to the interpretation of ω arises from that fact that many studies have shown that

genes under purifying selection show surprisingly high ω (often close to 1) when samples with short divergence

times are analyzed (e.g., those from a single population or species). For example, Rocha et al (2006) showed

that dN/dS between two samples is negatively correlated with their divergence times, and exemplified these

predictions with bacterial genomes. Likewise, a decrease of dN/dS with divergence time has been described in

Wolf et al (2009), but considering a much deeper timescale. Kryazhimskiy and Plotkin (2008) demonstrated

that this pattern is expected even under a regime of purifying selection that is constant over time. Thus, it

is plausible that the recent divergence times among alleles within HLA allelic lineages could result in inflated

287 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

11

intra-lineage ω values, explaining the modest differences between intra and inter-lineage ω values seen in the

phylogenetic analyses (Tables 2 and 3). To explore this issue further, we used non-ARS codons as an internal

control for this putative build-up of dN/dS are recent timescales, and to do so we compared their patterns

of variation to those of ARS codons. We found that non-ARS codons have larger intra-lineage ω values than

inter-lineage values, and also higher ω for terminal than internal branches (p < 0.05 for HLA-A in the intra

versus inter-lineage contrast, and for HLA-A and HLA-C in the tips versus internal contrast; LRT; Table

3). This distribution of ω values is in the exact opposite direction to that observed for the ARS (Table 2),

consistent with an effect of short divergence times inflating the estimates of ω (Kryazhimskiy and Plotkin

2008).

In order to formally test whether ARS and non-ARS codons have a different distribution of synonymous

and nonsynonymous changes intra and inter-lineages (or for internal and terminal branches) we employed a

contingency table approach similar to that of Templeton (1996). We used the inferred number of synonymous

(S) and nonsynonymous (N ) changes on each branch of the allelic phylogeny from each locus to estimate the

total number of each type of change in a specific class of branches (see Figure 3 for a schematic representation

of the branch labeling).The odds ratio was defined as presented in the Methods. For all loci, we find that

OR > 1 for non-ARS codons (proportionally more nonsynonymous on the intra-lineage branches) and OR < 1

for ARS codons (proportionally more nonsynonymous changes on the inter-lineage branches), as shown in

Table 5. This finding is consistent with the maximum likelihood estimates of ω for branches (Tables 2 and 3),

and the increased pairwise ω inter-lineage, relative to intra-lineage (Figure 2). To test for differences between

ARS and non-ARS codons, we pooled the contingency tables of all loci (due to the fact that several cells

for individual loci had low counts) and rejected the null hypothesis that contingency tables from ARS and

non-ARS codons have the same OR (p value = 0.0069; Breslow-Day test). Our analysis comparing internal − and terminal branches showed the same pattern, with proportionally more nonsynonymous changes in internal

branches for ARS codons (p value = 0.00013; Breslow-Day test; Table 5). − In summary, although there is evidence for an excess of inter-lineage nonsynonymous changes (or for

terminal branches) for ARS codons, there is also an enrichment for intra-lineage nonsynonymous changes for

ARS codons, when compared to non-ARS codons (P < 0.001; Fisher’s exact test). Next, we discuss possible

biases in the data set which could lead to these results.

3.4 Comparing dN/dS results with 1000 genomes variation

Our analyses are based on allele sequences available in the IMGT/HLA data base, which is a curated resource

to which newly discovered alleles are contributed. This data set is likely to be biased with respect to population

frequencies, since very rare HLA alleles are likely to represent a disproportionately larger fraction than in true

population samples, since all new alleles which are discovered are encouraged to be submitted to IMGT. We

therefore investigated if this bias influenced our findings. Specifically, we were concerned that the enrichment

for rare variants could result in an inflation of weakly deleterious nonsynonymous variants for recent divergence,

288 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

12

a well documented population genetic signature (Henn et al 2015). This signature could create an artificially

inflated value of ω for intra-lineage variability.

We found that only a subset of variable positions present in our IMGT-derived datasets are present in the

1000 Genomes Phase I low coverage data (Tables 6, 7, and 8 (for HLA-A, HLA-B, and HLA-C, respectively).

This is in accordance with the greater degree of sampling of rare variants in the IMGT data set.

We next divided positions into two groups: those which are only variable within a single lineage (’INTRA’),

and those variable in more than one lineage (’INTER’). For comparison, a third group, which consists of all

variable sites (’OVERALL’), was also defined. We found that, considering all INTRA and INTER positions

present in the 1000G data, there is no significant difference in minor allele frequency (MAF) between the two

categories (Tables 6, 7, and 8). Furthermore, when we classify the 1000G HLA SNPs into low (MAF<=0.1)

and high frequency (MAF>0.1), we do not see an enrichment for low frequency variants within the "INTRA"

set of SNPs when compared to the "INTER" set (Wilcoxon test, not shown).

These results reassure us that the intra-lineage variation we observe is not biased in the direction of

extremely rare variants, and that our observation that there is evidence for stronger intra-lineage balancing

selection for ARS codons than for neutrally evolving regions (non-ARS) is not a spurious result driven by an

enrichment for low-frequency SNPs.

4 Discussion

Our study documents a positive correlation between dN/dS values and the degree of divergence between

allele pairs. This result is supported by phylogenetic analyses, which show higher ω values for branches

connecting different lineages, or branches which are internal to the phylogeny. A heterogeneous nonsynonymous

substitution rate (dN) for HLA genes was also reported in a study which found that dN for ARS codons is

not linearly correlated with divergence time in classical HLA loci (Yasukochi and Satta 2014). By further

investigating the temporal dynamics in the DRB1 gene, these authors showed that this rate heterogeneity

is likely the consequence of a reduction in the substitution rates in specific allelic lineages, possibly as a

consequence of continuous selective pressure by a specific pathogen. In the present study our goal was to

explicitly test for heterogeneity in the ω ratios over a priori defined groups of alleles (the HLA allelic lineages)

and for timescales of divergence (low and high divergence). As was the case with the study of Yasukochi

and Satta (2014), we find heterogeneity in the intensity of selection, in our case with evidence of increased

selection at deeper timescales than at more recent ones, and for greater selection on inter-lineage branches of

the allelic phylogeny, with respect to intra-lineage branches. Our findings indicate that long-term balancing

selection has resulted in an enrichment for adaptive changes between allelic lineages for HLA class I genes,

with proportionally weaker signatures of molecular adaptation for recent (terminal and intra-lineage branches)

than for the inter-lineage and for the internal branches.

Although previous studies have shown that low divergence is often associated to inflated ω estimates (Rocha

et al, 2006), the phylogenetic analyses carried out in the present work relied on non-ARS codons as a control

289 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

13

to show that low divergence times of intra-lineage contrasts does not explain the ω > 1 values within lineages,

at ARS codons. Thus, while we show that inter-lineage selection is stronger than intra-lineage selection, our

results also demonstrate that intra-lineage variation bears a signature of balancing selection.

Recently several papers have drawn attention to the effects of divergence times on dN/dS estimation (e.g.

Wolf et al 2009; Stolestki and Eyre-Walker 2011), and the complexities of interpreting these values when data

is drawn from a single population (Rocha et al 2006; Kryazhimskiy and Plotkin 2008). Our finding of increased

ωARS among more divergent alleles (or for inter-lineage branches) is conservative in light of these findings, which predict decreased ω for more divergent alleles. We accounted for this effect by using non-ARS codons, which

have a similar phylogenetic structure to that of ARS codons (after removal of recombinants) to control for

the background inflation of omega in recently diverged alleles, and found that ARS codons have very different

distribution of ω, with increased inter-lineage evidence for selection, exactly the opposite to what is seen for

non-ARS codons.

An important caveat to this interpretation is that the temporal dynamics of dN/dS appears to be sensitive

to the selective regime which is assumed to be operating. Thus, while several authors have shown that, under

purifying selection, increased dN/dS at low divergence is expected, positive selection can produce a positive

correlation with divergence times (Dos Reis and Yang 2013; Mugal et al 2014), which could account for part of

the results we describe in this study. However, the case of directional positive selection, involving the sequential

substitution of adaptive mutations, is markedly different from the dynamics of a balanced polymorphism, as

is the case for HLA genes.

Assuming that balancing selection has been the main selective regime shaping the molecular evolution

of HLA genes, and that heterozygote advantage is one (even if not exclusively) of the mechanisms through

which selection has acted upon this system, our finding that inter-lineage ωARS is greater than intra-lineage is consistent with the divergent allele advantage model, according to which heterozygotes for more divergent

alleles have higher fitness than those carrying similar alleles (Wakeland et al 1990). Under this model, excess

of inter-lineage nonsynonymous changes in HLA genes would be expected, which is a result we have shown for

the ARS data set. This model has been shown to explain patterns of variation in the DRB locus in Galapagos

sea lions, where local allelic divergence at this locus positively influences fitness directly (Lenz et al 2013),

and not mere heterozygosity or number of alleles at the MHC locus. Most likely several selective regimes have

shaped the evolutionary history of MHC genes, as suggested by previous observations, and our contribution

suggests that these selective regimes could be operating alongside with divergent allele advantage.

Our results suggest that groups of functionally related alleles (in our analysis, the allelic lineages) should

be regarded as important targets of selection, rather than individual alleles. In line with our observations,

it has been proposed that HLA supertypes - groups of alleles sharing chemical properties at the B and F

pockets of the ARS region (Sidney et al 1996) - constitute the level of variation that is the primary target of

natural selection in HLA-B genes (Francisco et al 2015). Since there is a high overlap between allelic lineage

and supertype classifications(Sidney et al 1996), our results indicate that attempts to understand how natural

290 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

14

selection acts on HLA variation benefit by comparing the effects of selection on the allelic, allelic lineage or

supertype levels of variation.

Electronic Supplementary Material

Supporting tables are available as an additional file.

Competing Interests

The authors declare that they have no competing interests.

Author’s Contributions

BDB carried participated in the design of the study, performed analyses, discussed results and drafted the

manuscript. RDF performed analyses and discussions. DM conceived of the study, participated in its design

and discussion and in the drafting of the manuscript. All authors read and approved the final manuscript.

Acknowledgements The authors thank Kelly Nunes for thoughtful comments on the manuscript, Richard Single for comments on the statistical aspects of this work, Aida M. Andrés for general comments and Débora Y.C.Brandt for help with the 1000 Genomes data sets. This work was supported by the São Paulo Research Foundation (grants #2008/09127-8 and #2011/12500-2 to BDB; #08/56502-6 to DM) and Conselho Nacional de Desenvolvimento Científico e Tecnológico (#152676/2011-2 to BDB, #142130/2009-5 to RSF and #308960/2009-2 to DM). The final publication is available at Springer via http://dx.doi.org/DOI: 10.1007/s00239-015-9713-9

Data available in public repositories

https://github.com/bbitarello/dNdS-hla-allelic-lineages

References

Albrechtsen A, Moltke I, Nielsen R (2010) Natural selection and the distribution of identity-by-descent in the

human genome. Genetics 186(1):295–308

Anisimova M, Nielsen R, Yang Z (2003) Effect of recombination on the accuracy of the likelihood method for

detecting positive selection at amino acid sites. Genetics 164(3):1229–36

Apps R, Qi Y, Carlson JM, Chen H, Gao X, Thomas R, Yuki Y, Del Prete GQ, Goulder P, Brumme ZL,

Brumme CJ, John M, Mallal S, Nelson G, Bosch R, Heckerman D, Stein JL, Soderberg Ka, Moody MA,

Denny TN, Zeng X, Fang J, Moffett A, Lifson JD, Goedert JJ, Buchbinder S, Kirk GD, Fellay J, McLaren

P, Deeks SG, Pereyra F, Walker B, Michael NL, Weintrob A, Wolinsky S, Liao W, Carrington M (2013)

Influence of HLA-C expression level on HIV control. Science (80- ) 340(6128):87–91

291 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

15

Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987) Structure of the human

class I histocompatibility antigen, HLA-A2. Nature 329(6139):506–12

Chelvanayagam G (1996) A roadmap for HLA-A, HLA-B, and HLA-C peptide binding specificities.

Immunogenetics 45(1):15–26

Dean M, Carrington M, O’Brien SJ (2002) Balanced polymorphism selected by genetic versus infectious human

disease. Annu Rev Genomics Hum Genet 3:263–92

Doherty PC, Zinkernagel RM (1975) Enhanced immunological surveillance in mice heterozygous at the H-2

gene complex. Nature 256(5512):50–52

Dos Reis M, Yang Z (2013) Why do more divergent sequences produce smaller nonsynonymous/synonymous

rate ratios in pairwise sequence comparisons? Genetics 195(1):195–204

Felsenstein J (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5:164–166

Francisco RS, Buhler S, Nunes JM, Bitarello BD, França GS, Meyer D, Sanchez-Mazas A (2015) HLA supertype

variation in human populations: new insights about the role of natural selection on the evolution of HLA-A

and HLA-B polymorphisms. Immunogenetics, DOI 10.1007/s00251-015-0875-9.

Garrigan D, Hedrick PW (2003) Detecting adaptive molecular polymorphism : Lessons from the MHC.

Evolution (N Y) 57(8):1707–1722

Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Mol Biol Evol 11(5):725–736

Harris E, Meyer F (2006) The Molecular Signature of Selection Underlying Human Adaptations. Yearb Phys

Anthropol 130:89-130

Hedrick PW (2002) Pathogen resistance and genetic variation at MHC loci. Evolution (N Y) 56(10):1902–1908

Hedrick PW, Thomson G (1983) Evidence for balancing selection at HLA. Genetics 104(3):449–56

Henn B, Botigué LR, Bustamante C, Clark AG, Gravel S (2015) Estimating the mutation load in human

genomes. Nat Rev Genetics 16:333—343

Hilton HG, Guethlein LA, Goyos A, Nemat-Gorgani N, Bushnell DA, Norman PJ, Parham P (2015)

Polymorphic HLA-C Receptors Balance the Functional Characteristics of KIR Haplotypes. J Immunol

195:3160-3170

Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci

reveals overdominant selection. Nature 335(6186):167–170

Hughes AL, Nei M (1989) Nucleotide substitution at major histocompatibility complex class II loci: evidence

for overdominant selection. Proc Natl Acad Sci U S A 86(3):958–962

Hughes AL, Yeager M (1998) Natural selection at major histocompatibility complex of vertebrates. Annu Rev

Genet pp 415–435

Huttley G, Smith MW, Carrington M, O’Brien S (1999) A scan for linkage disequilibrium accross the human

genome. Genetics 152(4):1711–1722

Klein J, Sato A (2000) The HLA system. First of two parts. Adv Immunol 343(10):702–709

Kryazhimskiy S, Plotkin JB (2008) The Population Genetics of dN/dS. PLoS Genet 4(12):10

292 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

16

Lenz T, Mueller B, Trillmich F, Wolf JBW (2013) Divergent allele advantage at MHC-DRB through direct

and maternal genotypic effects and its consequences for allele pool composition and mating. Proc R Soc B

280: 20130714

Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P (2010) RDP3: a flexible and fast computer

program for analyzing recombination. Bioinformatics 26(19):2462–3

Meyer D, Thomson G (2001) How selection shapes variation of the human major histocompatibility complex:

a review. Ann Hum Genet 65(1):1–26

Mugal CF, Wolf JBW, Kaj I (2014) Why time matters: codon evolution and the temporal dynamics of dN/dS.

Mol Biol Evol 31(1):212–31

Penn DJ, Damjanovich K, Potts WK (2002) MHC heterozygosity confers a selective advantage against

multiple-strain infections. Proc Natl Acad Sci U S A 99(17):11,260–4

Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics

21(5):676-679

Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F (2005) Pathogen-driven selection

and worldwide HLA class I diversity. Curr Biol 15(11):1022–7

Richman A (2000) Evolution of balanced genetic polymorphism. Mol Ecol 9(12):1953–63

Robinson J, Halliwell Ja, McWilliam H, Lopez R, Parham P, Marsh SGE (2013) The IMGT/HLA database.

Nucleic Acids Res 41(Database issue):D1222–7

Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ (2006) Comparisons of dN/dS

are time dependent for closely related bacterial genomes. J Theor Biol 239(2):226–235

Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees.

Mol Biol Evol 4:406–425

Sidney J, Grey HM, Kubo RT, Sette A. (1996) Practical, biochemical and evolutionary implications of the

discovery of HLA class I supermotifs. Immunol Today 17(6): 261–6

Single RM, Martin MP, Gao X, Meyer D, Yeager M, Kidd JR, Kidd K, Carrington M (2007 Global diversity

and evidence for coevolution of KIR and HLA. Nat Genetics 9:1114–1119

Slade R, McCallum H (1992) Overdominant vs. frequency-dependent selection at MHC loci. Genetics

132:861–864

Spurgin LG, Richardson DS (2010) How pathogens drive genetic diversity: MHC, mechanisms and

misunderstandings. Proc Biol Sci 277(1684):979–88

Stolestki N, Eyre-Walker A (2011) The positive correlation between dN/dS and dS in mammals is due to runs

of adjacent substitutions. Mol Biol Evol 28(4):1371–1380

Takahata N, Nei M (1990) Allelic Genealogy Under Overdominant and Frequency-Dependent Selection and

Polymorphism of Major Histocompatibility Complex Loci. Genetics 124(4):967–978

Takahata N, Satta Y (1998) Footprints of intragenic recombination at HLA loci. Immunogenetics 47(6):430–441

Templeton AR (1996) Contingency tests of neutrality using intra/interspecific gene trees: the rejection of

neutrality for the evolution of the mitochondrial Cytochrome Oxidase II gene in the hominoid primates.

293 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

17

Genetics 144(3):1263–1270

The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human

genomes. Nature 491: 56 —65

Wakeland EK, Boehme S, She JX, Lu Cc, Mclndoe RA, Cheng I, Ye Y, Potts WK (1990) Ancestral

Polymorphisms of MHC Class II Genes : Divergent Allele Advantage. Immunol Res 9:115–122

Wolf JBW, Künstner A, Nam K, Jakobsson M, Ellegren H (2009) Nonlinear dynamics of nonsynonymous (dN)

and synonymous (dS) substitution rates affects inference of selection. Genome Biol Evol 1:308–319

Yang Z (2006) Computational molecular evolution. Oxford University Press, Oxford

Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol 24(8):1586–1591

Yang Z, Swanson WJ (2002) Codon-Substitution Models to Detect Adaptive Evolution that Account for

Heterogeneous Selective Pressures Among Site Classes. Mol Biol Evol 19(1):49 –57

Yang Z, Wong WSW, Nielsen R (2005) Bayes empirical bayes inference of amino acid sites under positive

selection. Mol Biol Evol 22(4):1107–1118

Yasukochi Y, Satta Y (2014) Nonsynonymous Substitution Rate Heterogeneity in the Peptide-Binding Region

Among Different HLA-DRB1 Lineages in Humans. G3 (Bethesda)

294 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

18

Locus All allelesa SM (R/NR)b Pairwise (R/NR)c BM pruned data setd Codons Total Non-ARS ARS HLA-A 1193 144/107 138/104 93 340 292 48 HLA-B 1799 233/78 173/71 63 324 276 48 HLA-C 829 133/109 125/110 105 341 293 48 Table 1 Number of alleles and codons for different data sets. a, included all available alleles in release 3.1.0, 2010-07-15., including possible recombinants; b, SM, data set used for site models, i.e, after selection of alleles with complete coding sequences; c, R/NR, with and without recombinants data sets; d, BM (branch models) pruned data set is the NR data set after prunning for alleles which do not cluster intra their respective allelic lineages (see Methods)

a b c d e f ′ Locus ω ωinter ωintra 2∆l ωinternal ωterminal 2∆l HLA-A 1.84 2.03 1.68 0.06 2.35 1.39 0.49 HLA-B 0.99 1.16 0.73 0.71 1.2 0.69 0.97 HLA-C 1.89 4.14 1.19 2.61 4.91 0.95 7.36* Table 2 Branch model dN/dS estimations and LRT results (ARS data sets). * significance at 5%; Data sets after removal of recombinants (NR); a, ω estimate under model 0 (one for all branches); b, ω inter lineages; c, ω intra lineages d, negative log-likelihood difference between two nested models; e, ω for internal branches; f, ω for terminal branches

a b c d e f ′ Locus ω ωinter ωintra 2∆l ωint ωter 2∆l HLA-A 0.53 0.40 0.77 2.8 0.39 0.95 4.57* HLA-B 0.42 0.40 0.55 0.34 0.39 0.66 0.86 HLA-C 0.50 0.39 0.79 3.97* 0.38 0.92 5.27* Table 3 Branch model dN/dS estimations and LRT results (non-ARS data set). * significance at 5%; Data sets after removal of recombinants (NR); a, ω estimate under model 0 (one for all branches); b, ω inter lineages; c, ω intra lineages; d, negative log-likelihood difference between two nested models; e, ω for internal branches; f, ω for terminal branches

non-ARS ARS Locus Quantilea dN dS ωb dN/dS dN > dSd dN dS ω dN/dS dN > dS

HLA-A 0.02c 0.05 0.35 0.35 628(6.64%) 0.12 0.07 1.36 1.74 7364(77.90%) 1 0.00 0.01 0.35 0.42 628 0.05 0.04 1.08 1.41 2132 2 0.02 0.05 0.398 0.397 0 0.12 0.06 1.47 1.94 2347 3 0.02 0.06 0.37 0.37 0 0.14 0.09 1.34 1.55 2316 4 0.02 0.08 0.29 0.29 0 0.15 0.08 1.50 1.97 2339 HLA-B 0.01 0.04 0.33 0.30 470(3.16%) 0.14 0.11 1.33 1.26 9908(66.59%) 1 0.01 0.02 0.46 0.46 470 0.10 0.09 1.17 1.08 2405 2 0.01 0.03 0.35 0.35 0 0.15 0.12 1.25 1.21 2460 3 0.01 0.05 0.27 0.27 0 0.15 0.13 1.28 1.18 2229 4 0.02 0.06 0.25 0.25 0 0.17 0.11 1.58 1.59 2814 HLA-C 0.02 0.05 0.38 0.37 474(6.12%) 0.07 0.02 1.22 3.04 6514(84.05%) 1 0.00 0.01 0.44 0.46 474 0.04 0.02 0.99 1.71 1303 2 0.01 0.04 0.31 0.31 0 0.07 0.02 1.04 3.52 1791 3 0.02 0.06 0.41 0.41 0 0.08 0.02 1.63 3.95 1810 4 0.03 0.08 0.37 0.37 0 0.09 0.03 1.55 3.35 1682

Table 4 Pairwise estimations for substitution rates (data sets prior to the removal of recombinants). a, quantiles of divergence (dSnon-ARS); b, average pairwise dN/dS; c, bold refers to the average pairwise values for each locus; d, percentages correspond to the proportion of pairs for which dN > dS in relation to the total number of pairwise comparisons

295 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

19

Data set Substitution Branch category intra inter terminal internal N 118.8 148 89.3 158.5 non-ARS S 39.4 106 24.9 115.1 OR = 2.15 OR = 2.96

N 172.7 230.7 144.3 291.2 S 18.5 17.5 17.4 17.7 ARS OR = 0.71 OR = 0.21

p = 6.9 10 3 b p = 1.3 10 4 × − ∗ × − ∗ Table 5 Distribution of changes for ARS and non-ARS codons. Counts correspond to the total (combined) values for HLA-A, -B and -C; *significant at 1%; N, nonsynonymous change; S, synonymous change; intra, intra lineage; inter, inter lineage; terminal, terminal branches; internal, internal branches

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 MAF Intra 68 29 5 24 0.15 Inter 88 55 12 43 0.14 Overall 156 84 17 67

Table 6 HLA-A: MAFs for SNPs in the 1000 Genomes dataset. Overall, set of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the ’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 MAF Intra 44 24 6 18 0.30 Inter 59 38 8 30 0.39 Overall 103 62 14 48

Table 7 HLA-B: MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the ’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 MAF Intra 78 27 8 19 0.26 Inter 68 55 19 36 0.24 Overall 146 82 27 55

Table 8 HLA-C: MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the ’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

296 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

20

Fig. 1 Overlap between two ARS classifications and two site models studies. BJOR and CHEV are ARS classifications (Bjorkman et al 1987; Chelvanayagam 1996); YANG is a list of codons with significant in HLA genes; BIT is the set of codons with from our SM (site models) approach (see Materials and Methods for details)

297 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

21

Fig. 2 Pairwise estimates for intra-lineage and inter-lineage pairs of alleles. These results refer to ARS data sets prior to the removal of recombinants, for pairwise analyses; Green, inter-lineage; purple, intra-lineage; gray, non-ARS ; * significant difference betweenω ¯ (intra) andω ¯ (inter) (p < 0.001, Wilcoxon rank sum test)

298 bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

22

Fig. 3 Schematic representation of the allelic phylogenies used in the branch models approach. Left: terminal vs internal branches; right: intra-lineage vs inter-lineage; For the branch models approach, we labeled branches of each tree (HLA-A, -B and -C) as “intra/inter” or “terminal/internal” and ran model 2 (CODEML), which allows for two independent ω values to be estimated, according to these labels

299