Characterization of Gene Expression Patterns in the Wild Pacific Salmon
Total Page:16
File Type:pdf, Size:1020Kb
Characterization of Gene Expression Patterns in the Wild Pacific Salmon by Evan Morien B.Sc., State University of New York at Buffalo, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) December 2012 c Evan Morien 2012 Abstract Declines in Pacific salmon stocks in recent decades have spurred much re- search into their physiology and survivorship, but comparatively little into their genomics. Sockeye salmon in particular are experiencing high levels of mortality during their migration upriver, and the numbers of returning sockeye have fluxuated wildly with respect to predictions in recent years. The goal of my project is to gain insight into the basic genomics of Pacific salmon stocks, including the sockeye, through bioinformatic approaches to gene expression profiling. Using microarray technology, I have conducted a large-scale analysis of over 1,000 samples from multiple tissues, stocks, and species of salmon. I identified tissue-specific and housekeeping genes and compared them to orthologs in mouse and human, respectively. I have also classified a number of microarray samples with a support vector machine (SVM) using qPCR data showing the presence of several common pathogens affecting Pacific salmon populations. Using identified housekeeping genes as normalizing factors, I modeled in silico a qPCR assay designed to identify salmon as infected or uninfected with a particular pathogen. With these data I hope to increase basic knowledge of the genomics of the Pacific salmon. ii Preface Contributions The data used in this thesis were generated at the Department of Fish- eries and Oceans Pacific Biological Station in Nanaimo, British Columbia by Kristina M. Miller, Ph.D., Molecular Genetics Section Head, and mem- bers of her lab. Shaorong Li is responsible for the generation of the qPCR data. Angela D. Schulze and Karia H. Kaukinen are responsible for the gen- eration of the microarray data. Paul Pavlidis, Ph.D., Associate Professor, University of British Columbia, supervised and directed my contributions to the project. I conducted the computational research detailed below, and wrote this thesis. iii Table of Contents Abstract ................................. ii Preface .................................. iii Table of Contents ............................ iv List of Tables . viii List of Figures .............................. x List of Abbreviations . xii Acknowledgements . xiii 1 Introduction ............................. 1 1.1 Consider the Salmon . 1 1.2 FishManOmics - Genomic Tools for Fisheries Management . 3 1.3 Thesis Overview . 4 1.4 Literature Review . 6 1.4.1 Genomics Research on Salmonids . 7 1.4.2 Genomics Research on Non-Salmonid Teleost Fish . 9 iv 1.4.3 Housekeeping and Tissue-Specific Gene Research in Other Organisms . 13 1.4.4 Pathogen Types and Host-Pathogen Interactions . 15 1.4.5 Supervised Machine Learning . 16 1.4.6 Using ROC Curves and AUC Values . 19 2 Methods ............................... 21 2.1 Biological Methods . 21 2.1.1 Introduction . 21 2.1.2 Data Collection . 21 2.1.3 Array Sample Preparation . 22 2.1.4 Arrays . 22 2.1.5 qRT-PCR . 23 2.2 Computational Methods . 23 2.2.1 Array Data Preparation . 23 2.2.2 Identifying Housekeeping Genes . 24 2.2.3 Identifying Tissue-Specific Genes . 25 2.2.4 Comparison of TS Genes to Mouse Counterparts . 26 2.2.5 qRT-PCR Data Prep . 26 2.2.6 Functional Enrichment Analysis . 27 2.2.7 SVM Classification . 28 2.2.8 Using Control Probes to Normalize Data . 28 3 Results ................................ 30 3.1 Housekeeping Gene Identification . 30 3.2 Comparison of Housekeeping Genes to Human Counterparts 30 v 3.3 Tissue Specific/Selective Gene Identification . 31 3.4 Comparison of TS Genes to Murine Counterparts . 33 3.5 Functional Enrichment of TS Genes . 35 3.6 SVM Classification and Prediction of Pathogen States . 35 3.6.1 Functional Enrichment Analysis of Pathogen-specific Probes . 44 3.6.2 Differences Between Pathogen Datasets . 45 3.6.3 Classification of Fish Infected with Parvovirus Alone, or No Pathogens . 50 3.6.4 Fish Infected with Multiple Pathogens . 51 3.7 Using Control Probes to Normalize Data . 52 4 Discussion .............................. 55 4.1 What is an appropriate reference gene? . 55 4.2 Comparison of Salmon to Mouse and Human Genes . 57 4.2.1 Biological Implications of Overlap . 57 4.2.2 Functional Enrichment of TS Genes . 59 4.3 SVM Classification . 59 4.4 Functional Enrichment of Pathogen-Specific Signals, and Infection- Type-Specific Responses . 61 4.5 Using Control Genes to Normalize Data . 62 4.6 Conclusion . 63 Bibliography ............................... 64 vi Appendices A Reference Gene Candidates ................... 77 B Tissue-Specific Probes . 219 B.1 Brain-Specific Probes . 219 B.2 Gill-Specific Probes . 297 B.3 Liver-Specific Probes . 338 B.4 Muscle-Specific Probes . 400 C Tissue-Specific Functional Enrichment . 448 C.1 Brain-Specific Enrichment . 448 C.2 Gill-Specific Enrichment . 451 C.3 Liver-Specific Enrichment . 451 C.4 Muscle-Specific Enrichment . 453 D Pathogen-Specific Functional Enrichment . 455 D.1 Microsporidia-Specific Enrichment . 455 D.2 Gill Chlamydia-Specific Enrichment . 455 D.3 Parvovirus-Specific Enrichment . 458 vii List of Tables 2.1 Tissue and species data sources. Fish were harvested from 2007-2010 in the Fraser River, Straight of Georgia, and sur- rounding ocean waters. All fish are smolts (juveniles). 24 3.1 Expected and observed overlap between the human house- keeping and salmon housekeeping genes. 31 3.2 Expected vs observed overlap between mouse and salmon tissue-specific genes. p-values calculated using the hypergeo- metric test. 34 3.3 Microsporidia training AUC values using the full dataset of 141 gill samples. 36 3.4 Training and test AUC values along with the ratio of support vectors (SVs) to total vectors (TVs) for microsporidia data. 37 3.5 Training and test AUC values along with the ratio of support vectors (SVs) to total vectors (TVs) for gill chlamydia data. 42 3.6 Classification performance using gill chlamydia pathogen qPCR data with linear and radial basis kernel functions. 42 3.7 Numbers of significantly differentially expressed probes and genes for fish infected with each pathogen alone and fish in- fected with all pathogens compared to pathogen-free fish. 43 3.8 Training AUC values along with the ratio of support vectors (SVs) to total vectors (TVs) for parvovirus-only data. 50 3.9 Training AUC values along with the ratio of support vectors (SVs) to total vectors (TVs) for classification of microsporidia- only fish, fish infected with more than one pathogen, fish in- fected with all three pathogens, and gill chlamydia-only fish. 52 viii 3.10 Full dataset microsporidia training AUCs from non-normalized (NN) and quantile-normalized (QN) data that has been con- trol corrected (CC) using the ten control probes selected. Alongside are shown the ratios of support vectors to total vectors (SVs/TVs) for each of the four sets of results. 54 ix List of Figures 2.1 Binarized qPCR data for three pathogens in gill samples. The data has been sorted in ascending order according to the num- ber of positive data points present. Dark boxes indicate the presence of a pathogen, light boxes indicate no pathogen was detected. This was the only pathogen data used. 27 3.1 Heatmap showing tissue specific/selective probes found in each of the four tissues assayed. Expression level increases from orange to white. The 1,086 samples are organized in tis- sue groups, while probes are sorted by p-value. p-values calcu- lated using t-tests for differential expression between tissues. 3,238 probes are represented, corresponding to 977 genes. Nu- meric labels show the number of unique genes associated with each group of TS probes. 32 3.2 The degree of overlap between murine and salmonid TS genes in brain, liver, and muscle. 34 3.3 Training and test ROC curves from classification of fish using microsporidia labels. 38 3.4 Training ROC curves from gill chlamydia (top) at 15 features and parvovirus (bottom) at 50 features. 40 3.5 Test ROC curves from gill chlamydia (top) at 15 features and parvovirus (bottom) at 50 features. 41 x 3.6 CT values for fish testing positive for microsporidia, gill chlamy- dia, and parvovirus. Red points represent individual CT val- ues. Low CT values indicate the presence of more pathogen DNA in the qPCR sample. A CT value of 38 indicates that there was no pathogen DNA present in the sample. 46 3.7 Histograms of CT values for microsporidia (top left), gill chlamydia (top right), and parvovirus (bottom). 47 3.8 CT values for fish testing positive for microsporidia, gill chlamy- dia, and parvovirus. Red points represent individual CT val- ues. Low CT values indicate the presence of more pathogen DNA in the qPCR sample. A CT value of 38 indicates that there was no pathogen DNA present in the sample. True posi- tive fish (n=58, mean CT=28.3) are those which are identified correctly as infected by the classifier, and false negative fish (n=12, mean CT=29.7) are those which are infected but are misclassified as uninfected. Remaining fish are either false positives or true negatives and have scores of 38 (not shown). 49 3.9 Heatmap showing the expression of the nine control probes alongside the expression from selected experimental probes. 53 xi List of Abbreviations AUC - area under the (ROC) curve cDNA - complementary DNA cGRASP - Consortium for Genomic Research on All Salmonids Project DFO - Department of Fisheries and Oceans FC - fold change q(RT)PCR - quantitative (real-time) polymerase chain reaction ROC - receiver operating characteristic SVM - support vector machine TS - tissue-specific and tissue-selective xii Acknowledgements I wish to thank Paul Pavlidis, my supervisor, for his guidance, advice, and patience throughout my time in the lab; Kristina M.