FUNCTIONAL ASSESSMENT OF AMINO ACID VARIATION IN HUMAN GENOMES
A Dissertation Presented to The Academic Faculty
By
Thanawadee Preeprem
In Partial Fulfillment Of the Requirements for the Degree Doctor of Philosophy in Bioinformatics
Georgia Institute of Technology
May 2014
Copyright © Thanawadee Preeprem 2014
FUNCTIONAL ASSESSMENT OF AMINO ACID VARIATION IN HUMAN GENOMES
Approved by:
Dr. Greg Gibson, Advisor Dr. Fredrik O. Vannberg School of Biology School of Biology Georgia Institute of Technology Georgia Institute of Technology Dr. Stephen C. Harvey Dr. May D. Wang School of Biology Dept. of Biomedical Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. I. King Jordan School of Biology Georgia Institute of Technology Date Approved: March 27, 2014
To my parents
ACKNOWLEDGEMENTS
I have been fortunate and grateful to have Dr. Greg Gibson as my advisor. I joined his lab
late in my graduate life and he provided me an excellent research environment that I can discuss my ideas with him enthusiastically. I appreciate very much of his valuable scientific advice and his great help that has made my dissertation possible.
I also would like to thank Dr. Stephen C. Harvey, my former advisor, for his help in my personal and academic development. I am grateful for his teachings and trainings that have made me an independent and very capable researcher.
The mentorship from my both advisors has contributed a lot to my success.
I appreciate the guidance, kind advice and feedback from my committee members: Dr.
Greg Gibson, Dr. Stephen C. Harvey, Dr. I. King Jordan, Dr. Fredrik O. Vannberg, and
Dr. May D. Wang. Their encouragements also boost up the confident in what I do. All of their advices will continue to motivate me during my future academic career.
During my teaching assistantship, I had privileges to work with three distinct professors:
Dr. Mark Borodovsky, Dr. Ingeborg Schmidt-Krey, and Dr. Mirjana Brockett. Working with them helped me improve a great deal of understanding of academic life. I enjoyed teaching and was delighted to receive great feedbacks from these role models. The experience I gained had sharpened my teaching skills till these days.
I want to thank all of the past and present members of the Harvey’s and Gibson’s labs for their friendship and collaborations. My special thanks go to Minmin Pan, Burak Boz and,
iii
Jared Gossett from the Harvey’s lab for everything they shared with me in my early
Ph.D. career.
In addition to those individuals, I also have come across many more people during my
graduate study. I acknowledge their lively and inspiring scientific conversations which keep my fascination for science alive.
I am very pleased to have Ziming Zhao, Jianrong Wang, Burak Boz, Jared Gossett, and
Shefaet Rahman as friends who are always with me in every aspect of my ups and downs.
I appreciate my parents, sister and brother for their love, support, help, and patient. For me, pursuing a Ph.D. is quite a difficult and a lonely path. My family’s belief in me always keeps me focused and be positive.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ...... iii
LIST OF TABLES ...... xi
LIST OF FIGURES ...... xiv
LIST OF ABBREVATIONS ...... xvi
SUMMARY ...... xviii
CHAPTER 1: INTRODUCTION ...... 1
Genomic variations of human genomes ...... 1
Current challenges in genome studies...... 2
Roles of bioinformatics for evaluating genome variants ...... 3
Limitations of sequence conservation-based approaches for predicting variant
deleteriousness ...... 4
Integrative analysis of variant effects: a novel approach for personal genome
interpretations ...... 4
Description of this dissertation ...... 6
CHAPTER 2: AN ASSOCIATION-ADJUSTED CONSENSUS DELETERIOUS
SCHEME TO CLASSIFY HOMOZYGOUS MISSENSE MUTATIONS FOR
PERSONAL GENOME INTERPRETATION [40] ……………..…………………… 9
v
Abstract ...... 9
Background ...... 10
Methods...... 13
Whole genome sequence dataset ...... 13
Sequence annotation using published algorithms ...... 17
Supervised and automated structure-based predictions of variant function ...... 21
Assessment of functional enrichment ...... 23
Results and discussion ...... 24
Sequence-based variant description in 12 genomes...... 24
Association-adjusted consensus deleterious scheme (AACDS) for variant
classification ...... 28
Supervised structure-based variant evaluation...... 33
Automated structure-based variant evaluation...... 42
Enrichment for mutations disrupting protein interactions ...... 43
Conclusions ...... 45
Acknowledgements ...... 46
CHAPTER 3: AACDS—A DATABASE FOR PERSONAL GENOME
INTERPRETATION [102] ...... 47
Abstract ...... 47
Background ...... 48
vi
Construction and content ...... 51
Data sources ...... 51
Database construction ...... 52
Utility and Discussion ...... 56
Performing the search via a single entry query ...... 58
Performing the search using batch query ...... 59
Conclusions ...... 60
Availability and requirements ...... 61
Acknowledgments...... 61
CHAPTER 4: SDS, A STRUCTURAL DISRUPTION SCORE FOR
ASSESSMENT OF MISSENSE VARIANT DELETERIOUSNESS [110] ...... 62
Abstract ...... 62
Introduction ...... 63
Materials and Methods ...... 65
Genomic dataset and candidate protein sequences ...... 67
Gene and variant annotations ...... 68
Protein structure dataset ...... 69
Inferring variant deleteriousness from sequence-based predictors ...... 71
Additional parameters for sequence-based analysis ...... 72
Inferring variant deleteriousness from structure-based predictors...... 72
vii
Additional parameters for structure-based analysis ...... 76
Statistical comparison of positive and negative SNPs ...... 76
Assigning a structural disruption score to candidate epilepsy variants ...... 82
Results ...... 83
Candidate gene and variant annotations ...... 83
Statistically significant differences between positive and negative SNPs ...... 84
Structural features predict deleteriousness of case SNPs ...... 86
Individual assessment of putative structurally deleterious variants ...... 94
Structural disruption score correlates with sequence-based deleterious score ...... 97
Discussion ...... 99
Current perspectives in prioritization of epilepsy variants ...... 99
Key findings ...... 100
Study limitations ...... 102
Study innovations...... 103
Conclusion ...... 104
Acknowledgments...... 105
CHAPTER 5: SYSTEMATIC 3D SCREENING OF AMINO ACID MUTATIONS
IN PHARMACOGENES ...... 106
Abstract ...... 106
Introduction ...... 107
viii
Methods...... 113
Dataset of 48 Very Important Pharmacogenes (VIPs) ...... 113
Genomic variation dataset...... 114
Protein structure dataset ...... 117
Identification of conserved protein domain families ...... 119
Characteristics of amino acid mutations at different protein regions ...... 120
Statistical comparison of disrupted protein residues and the implementation of SDS
Pharmacogene for predicting the structural significance of a variant ...... 127
Evaluation of conventional deleterious classifiers and protein characters for
assessing amino acid variants in pharmacogenes ...... 131
Results and discussion ...... 133
Location of missense variants in protein structures ...... 133
Physicochemical change of amino acids at different protein regions ...... 135
Implementation of systematic 3D screening for structure-function relationships of
amino acid mutations ...... 137
Implementation of SDS Pharmacogenes for predicting structural significance of a
variant ...... 139
Structural characteristics of functional variants ...... 140
An exceptional case ...... 142
Performance of conservation-based predictors and SDS for assessing amino acid
variants in pharmacogenes ...... 144
ix
Limitations of pharmacogenomics studies...... 150
Conclusions ...... 151
Acknowledgments...... 152
CHAPTER 6: CONCLUSIONS ...... 153
Sequencing technology development and the outlook for genomic variant analysis pipelines ...... 153
Complexities and future directions for interpreting amino acid variants ...... 156
Summary of this dissertation ...... 158
APPENDIX A: SUPPLEMENTARY INFORMATION FOR CHAPTER 2 ...... 164
APPENDIX B: SUPPLEMENTARY INFORMATION FOR CHAPTER 5 ...... 186
REFERENCES ...... 200
x
LIST OF TABLES
Table 2.1: Summary of genetic variations in genome sequences of 12 individuals...... 16
Table 2.2: AACDS classification of homozygous nsSNPs in 12 genomes...... 19
Table 2.3: Functional annotation of all homozygous nsSNP...... 25
Table 2.4: List of all four known disease-causal variants and six probable pathogenic variants...... 30
Table 3.1: Descriptions of the 12 combined AACDSS classes...... 54
Table 4.1: Number of variants within each gene set, classified into three classes (cases, negative controls, and positive controls), and numbers of 3D structures used in the analysis...... 71
Table 4.2: Categories for structural indicators, cutoff values for continuous numerical parameters, and number of SNPs with extreme measures...... 77
Table 4.3: T-test statistics for gene sets 1 and 2...... 78
Table 4.4: Fisher’s exact test statistics for gene sets 1 and 2...... 81
Table 4.5: Case SNPs with high structural disruption scores...... 89
Table 4.6: Summary of structural disrupted case SNPs...... 91
Table 4.7: Step-wise analysis for correlation of SDS and Condel score...... 98
Table 5.1: List of 48 VIPs and number of their pharmacogenomics associated variants.
...... 109
Table 5.2: Statistics of genomic variability data of the 48 VIPs and 3D structure maps.
...... 118
xi
Table 5.3: List of computational tools used to analyze structurally-related attributes for each amino acid residue...... 124
Table 5.4: Fisher’s exact test statistics for enriched structural features present in functional mutations and the selection for strongest predictive features for structural disturbance (Structural Disruption Index)...... 129
Table A.1: List of protein structures used in supervised structural analysis...... 171
Table A.2: List of protein structures used in automated structural analysis...... 172
Table A.3: List of all private variants in the 12 genomes...... 173
Table A.4: List of all Categories 2A/2B variants affecting the same gene in more than one individual...... 176
Table A.5: List of all Category 2B variants...... 178
Table A.6: List of all Category 3B variants...... 179
Table A.7: List of all Category 4 Variants...... 181
Table A.8: Summary of automated structural analysis...... 183
Table A.9: List of X-linked recessive mutations...... 184
Table B.1: List of molecular functions and the numbers of drug partners for the 48 VIPs.
...... 186
Table B.2: A list of selected protein 3D structures, their data sources and the quality parameters...... 188
Table B.3: Complete list of genomic variability data of the 48 VIPs and the statistics of
3D structure maps...... 191
Table B.4: Complete list of domain names and relative abundances of functional and
neutral variants...... 193
xii
Table B.5: Complete statistics of Fisher’s exact test for enriched structural features present in functional and neutral mutations...... 196
xiii
LIST OF FIGURES
Figure 1.1: Types and number of genetic variations found per a human genome and numbers of known nsSNPs that are associated with diseases……………………………. 1
Figure 2.1: Flow diagram for AACDS classification algorithm………………………... 12
Figure 2.2: Number of deleterious and conserved site predictions……………………... 26
Figure 2.3: Cumulative distribution plots for the six deleterious prediction scores……. 27
Figure 2.4: Distribution of homozygous nsSNPs by sequence type……………………. 35
Figure 2.5: Location of variants in six protein structures………………………………. 37
Figure 3.1: AACDS database schema. AACDS database constructs its data relationships from several sources……………………………………………………………………. 53
Figure 3.2: Number of nsSNPs within each AACDS category………………………… 55
Figure 3.3: Overview of the AACDS web interface……………………………………. 57
Figure 4.1: Flow diagram of the analysis pipeline……………………………………… 66
Figure 4.2: Density plots of six deleterious scores for Case, Neutral and Causal SNPs... 79
Figure 4.3: Density plots for relative solvent accessibility and free energy change for
Case, Neutral and Causal SNPs………………………………………………………… 80
Figure 4.4: 3D structures for the nine high-priority variants for epilepsy……………… 93
Figure 5.1: Relative abundance of functional and neutral variants across conserved protein domain families……………………………………………………………….. 134
Figure 5.2: Performance of conservation-based predictors across the three types of mutations in 48 VIPs…………………………………………………………………... 146
xiv
Figure 5.3: Percentages of variants with respect to their scores for structural significance
(Structural Disruption Score; SDS) and the consensus scores for conservation-based deleteriousness (Deleterious count)…………………………………………………… 147
Figure 6.1: Variant analysis and filtration in whole genome sequencing……………… 154
Figure 6.2: Tools for analyzing SNPs form different genome regions………………... 155
xv
LIST OF ABBREVATIONS