Expansion of Disease Gene Families by Whole Genome Duplication in Early Vertebrates Param Priya Singh
Total Page:16
File Type:pdf, Size:1020Kb
Expansion of disease gene families by whole genome duplication in early vertebrates Param Priya Singh To cite this version: Param Priya Singh. Expansion of disease gene families by whole genome duplication in early verte- brates. Bioinformatics [q-bio.QM]. Institut Curie, Paris; Université Pierre et Marie Curie; Paris 6, 2013. English. tel-01162244 HAL Id: tel-01162244 https://tel.archives-ouvertes.fr/tel-01162244 Submitted on 10 Jun 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Public Domain THÈSE DE DOCTORAT DE l’UNIVERSITÉ PIERRE ET MARIE CURIE Spécialité Informatique École doctorale Informatique, Télécommunications et Électronique (Paris) Présentée par Param Priya SINGH Pour obtenir le grade de DOCTEUR de l’UNIVERSITÉ PIERRE ET MARIE CURIE Sujet de la thèse : Expansion des familles de gènes impliquées dans des maladies par duplication du génome chez les premiers vertébrés (Expansion of disease gene families by whole genome duplication in early vertebrates) Soutenue le 11 Décembre 2013 Devant le jury composé de : M. Hugues ROEST-CROLLIUS Rapporteur M. Pierre PONTAROTTI Rapporteur M. Lluis QUINTANA-MURCI Examinateur M. Gilles FISCHER Examinateur Mme. Ilaria CASCONE Examinateur M. Hervé ISAMBERT Directeur de thèse Contents Acknowledgements iii Abbreviations & Definitions ix I Introduction1 1 Preamble 3 1.1 Résumé de la thèse......................................3 1.2 Thesis summary.......................................5 1.3 Organization of the thesis.................................6 1.4 Publications resulted/forthcoming from this thesis...................6 2 Evolution by Gene Duplication7 2.1 Mechanisms of gene duplication..............................7 2.1.1 Unequal crossing over...............................7 2.1.2 Retroposition.....................................9 2.1.3 Non-homologous mechanisms........................... 10 2.1.4 Whole genome duplication............................. 10 3 Evolutionary Constraints & Retention of Duplicated Genes 13 3.1 Neofunctionalization..................................... 13 3.2 Subfunctionalization..................................... 14 3.3 Buffering against deleterious mutations......................... 14 3.4 Dosage balance hypothesis................................. 15 4 Whole Genome Duplications & Evolution of Vertebrates 17 5 Objectives of This Thesis 21 II Materials & Methods 23 6 Identification of Ohnologs 25 6.1 Input genomes, orthologs and paralogs.......................... 25 6.1.1 Protein coding genes and their genomic coordinates.............. 25 6.1.2 Orthologs and paralogs............................... 27 6.2 Identification of the synteny blocks and anchors.................... 28 6.3 Calculation of P-value to rule out spurious synteny.................. 30 6.4 Identify putative ohnolog pairs.............................. 31 6.5 Combine P-value from anchors............................... 32 v vi CONTENTS 6.6 Sample genomes with multiple window sizes...................... 32 6.7 Combine P-value from all outgroups........................... 33 6.8 Filter ohnolog pairs to remove false positives...................... 33 6.9 Construction of ohnolog families.............................. 33 6.10 Randomization of the human genome.......................... 34 6.11 Small Scale Duplicates (SSD)............................... 34 6.12 Ohnologs in the teleost fish genomes: the 3R-WGD.................. 35 6.12.1 The 2R-WGD..................................... 35 6.12.2 The 3R-WGD..................................... 36 6.13 Development of the OHNOLOGS server......................... 36 7 Collection of Cancer/Disease Genes & Functional Genomic Data 37 7.1 Cancer genes......................................... 37 7.1.1 Oncogenes and tumor suppressors........................ 39 7.1.2 “Core” cancer genes................................. 40 7.2 Dominant & recessive disease genes........................... 40 7.3 Haploinsufficient and dominant negative genes.................... 40 7.4 Genes with autoinhibitory protein folds......................... 41 7.5 Genes coding for protein complexes............................ 42 7.5.1 Human protein reference database........................ 42 7.5.2 Comprehensive resource of mammalian protein complexes......... 42 7.5.3 Gene ontology.................................... 43 7.5.4 Census of soluble human protein complexes.................. 43 7.5.5 Permanent complexes................................ 43 7.6 Essential genes........................................ 43 7.6.1 Human orthologs of Mouse essential genes................... 43 7.6.2 Human essential genes from In-vitro knock-out experiments........ 44 7.7 Genes with copy number variations............................ 45 7.8 Expression Level....................................... 45 7.9 Disease genes in other vertebrates............................ 45 7.9.1 Mouse......................................... 46 7.9.2 Rat........................................... 46 7.10 Analysis of ohnologs conservation using Ka/Ks ratios................. 46 8 Causal Inference Analysis 49 8.1 Mediation Analysis...................................... 49 8.1.1 Total, direct & indirect effects........................... 49 8.1.2 Mediation calculations............................... 50 8.1.3 Interpretation of Mediation results........................ 51 8.1.4 Application on genomic properties........................ 52 III Results 53 9 Characterization of Vertebrate Ohnologs 55 9.1 Combining information from Multiple outgroups improves ohnolog detection... 55 9.1.1 Comparison with randomized human genome................. 56 9.2 Comparison of ohnologs with published datasets.................... 58 9.3 Ohnolog family size distribution.............................. 61 CONTENTS vii 9.4 Ohnolog pairs for other vertebrates............................ 62 9.5 The OHNOLOGS server.................................. 63 9.5.1 Search......................................... 63 9.5.2 Interpretation of an ohnolog family........................ 64 9.5.3 Browse & Download................................. 65 9.6 Ohnologs in the Teleost fish genomes........................... 66 9.6.1 Ohnologs from the 2R-WGD............................ 66 9.6.2 Ohnologs from the 3R-WGD............................ 67 10 Enhanced Retention of “Dangerous” Genes by WGD 69 10.1 The Majority of “dangerous” genes retain more ohnologs............... 69 10.1.1 Ohnolog–disease association is consistent for high confidence ohnolog datasets 72 10.1.2 Enhanced retention of “dangerous” ohnologs in Mouse & Rat genomes.. 73 10.2 “Dangerous” genes show no biased retention by SSD or CNV............ 74 10.2.1 Small scale duplicates from Ensembl...................... 74 10.2.2 Small scale duplicates from sequence comparisons.............. 76 10.2.3 Ohnolog and SSD retention bias in different human primary tumors... 76 10.3 Mapping cancer and disease gene duplications on Ensembl duplication nodes.. 77 10.4 Ohnologs are more conserved than non-ohnologs.................... 78 10.5 Dominant, and not recessive disease genes have retained more ohnologs..... 80 10.5.1 Recessive disease genes............................... 80 10.5.2 Essential genes.................................... 81 11 Dosage Balance, Expression level & Human Ohnologs 83 11.1 Mixed susceptibility of human ohnologs to dosage balance.............. 84 11.1.1 High retention of protein complexes in ohnologs................ 84 11.1.2 Transient versus permanent complexes..................... 85 11.1.3 Susceptibility of human protein complexes to disease mutations...... 86 11.2 Gene expression level and human ohnologs....................... 86 11.3 Sequence conservation and ohnolog retention...................... 88 12 Indirect Causes of Ohnolog Retention 91 12.1 The effect of dosage balance is mediated by mutation susceptibility........ 92 12.1.1 Mediation of ‘Dosage.Bal.’ ‘Ohnolog’ by ‘Delet.Mut.’ genes..... 95 Æ) 12.1.2 Mediation of ‘Delet.Mut.’ ‘Ohnolog’ by ‘Dosage.Bal.’ genes..... 96 Æ) 12.1.3 Mediation of ‘Dosage.Bal.’ ‘Ohnolog’ by ‘Delet.Mut.’ genes after Æ) excluding SSD and CNV genes.......................... 97 12.1.4 Mediation of ‘Delet.Mut.’ ‘Ohnolog’ by ‘Dosage.Bal.’ genes after Æ) excluding SSD and CNV genes.......................... 98 12.2 Small effect of essentiality on ohnolog retention.................... 99 12.3 Negative causal effect of high expression on ohnolog retention........... 100 12.4 Sequence conservation & ohnolog retention....................... 101 12.4.1 Mediation with low Ka/Ks values......................... 101 12.4.2 Mediation with high Ka/Ks values........................ 102 13 Population Genetic Model for the Retention of “Dangerous” Ohnologs 105 IV Discussion & Perspectives 109 14 Discussion & Perspectives 111 A Articles 117 List of Figures 140 List of Tables 141 Bibliography 156 Abbreviations & Definitions WGD Whole Genome Duplication Ohnolog(s) Gene(s) retained from Whole Genome Duplications 2R-WGD Two Rounds of Whole Genome Duplication in Vertebrate Ancestor 3R-WGD Third round of Whole