Genome Annotation of Pogonomyrmex Californicus
Total Page:16
File Type:pdf, Size:1020Kb
Westphalian Wilhelms-University Munster¨ Master's Thesis Genome Annotation of Pogonomyrmex californicus Examiner: Author: Prof. Dr. Juergen Gadau Jonas Bohn Prof. Dr. Wojciech Makalowski A thesis submitted in fulfillment of the requirements for the degree of Master of Science at the Institute of Bioinformatics April 2019 Declaration of Academic Integrity I hereby confirm that this thesis on Genome Annotation of Pogonomyrmex californicus is solely my own work and that I have used no sources or aids other than the ones stated. All passages in my thesis for which other sources, including electronic media, have been used, be it direct quotes or content references, have been acknowledged as such and the sources cited. (date and signature of student) I agree to have my thesis checked in order to rule out potential similarities with other works and to have my thesis stored in a database for this purpose. (date and signature of student) i Acknowledgements I thank the whole team of Prof. Gadau and Prof. Makalowski for the excellent support and the friendly interaction. I would like to thank you for always having a contact person who was able to help me by solving problems. I want to thank Phd. student Reza Halabian for his very helpful support and educational conversations. In addition, I would like to thank the computer specialist Norbert Grundmann for his persistent and also instructive support with several PC problems. I dedicate this thesis to my mother and my deceased father who has always supported me. ii iii Abstract Background During the last four decades sequencing technologies developed over three generations depending on changes in the scope of genome projects. In order to analyze this increas- ing volume of data, new genome annotation methods continue to be developed. Ants like Pogonomyrmex californicus belong to a very divergent subfamily of insects. Un- fortunately, just a few genomes from Myrmicinae have been sequenced so far. In order to increase the genetic knowledge about ants from this subfamily, P. californicus was annotated from a high-quality draft genome assembly. Results The analysis of P. californicus sequence data resulted in an identification of 394,064 repeats where 30,292 are unclassified. These repeats mask about 22 % of the whole genome assembly. The structural annotation part detected 22,844 unique proteins which were coded by 23,874 transcripts. About 9,000 proteins do have detected Pfam domains. From these, about 8,500 have additionally InterPro domains. Inside these, about 5,500 have additionally gene ontology annotations. The function of 49 % of characterized unique proteins were detected. 1,572 ncRNAs were detected. This include 89 rRNA as well as 1,146 tRNA genes which are coding for 20 amino acids. Additionally 182 pseudogenes, 21 undetermined genes, and one suppressor gene were identified in this pool of RNAs. Conclusion This thesis resulted in an annotation-directed improvement annotation of the high- quality draft genome assembly. A high amount of duplications were registered in the genome assembly which leaded to a greater assembly as expected. Additional sequence data as well as further processing of the assemblies would lead to better quality. As about 56 % of proteins were identified, the detection of functions of predictions as well as manual annotation of not characterized high confidence proteins are required for a more complete annotation. However, the accordance of described functions of proteins with published proteins for P. californicus as well as the high number of identified high confidence proteins show good quality of the annotation. Improvement of the genome- and transcript assembly is necessary in order to complete the annotation and delete potential error sources. Contents Declaration of Authorshipi Acknowledgements ii Abstract iii Contents iv List of Figures vi List of Tables vii Abbreviations viii 1 Introduction1 1.1 DNA Sequencing................................2 1.1.1 Next generation Sequencing (NGS)..................4 1.1.2 Third Generation Sequencing (TGS).................8 1.2 Genome annotation...............................8 1.2.1 Quality assessment...........................9 1.2.2 Repeat identification and annotation................. 11 1.2.3 Structural genome annotation..................... 12 1.2.4 Measuring the accuracy of gene prediction and gene annotation.. 15 1.2.5 Functional genome annotation.................... 17 1.3 Pogonomyrmex californicus .......................... 18 2 Material and Methods 22 2.1 Material..................................... 22 2.1.1 genomic DNA data........................... 23 2.1.2 transcriptomic RNA data....................... 24 2.2 Methods..................................... 25 2.2.1 Transcript assembly construction................... 26 2.2.2 Genome annotation.......................... 28 3 Results 35 iv Contents v 3.1 Transcript assembly construction....................... 35 3.2 Repeat annotation............................... 36 3.3 Structural annotation............................. 37 3.3.1 GeneModelMapper (GeMoMa).................... 37 3.3.2 MAKER................................. 38 3.4 Functional annotation............................. 38 3.5 non-coding RNA annotation.......................... 41 4 Discussion 44 4.1 gDNA sequencing................................ 44 4.2 Genome assembly................................ 45 4.3 Transcript assembly.............................. 51 4.4 Repeat annotation............................... 52 4.5 Structural annotation............................. 53 4.6 Functional annotation............................. 57 5 Conclusion 60 6 Availability 61 A Quality of gDNA-sequencing 62 B Quality of RNA-sequencing 64 C Detection of duplicates 66 D Programs used for the Genome annotation 69 E Data from relative species 71 F Assessment of Annotation 74 G Repeat annotation 76 H Functional annotation 79 Bibliography 81 List of Figures 1 Sanger sequencing principle...........................3 2 Genome cost development............................5 3 Development of sequencing...........................6 4 Gene prediction and gene annotation..................... 13 5 Gene prediction statistic............................ 16 6 Genome size estimations of ants........................ 18 7 Cladogram of relative species.......................... 19 8 Karyotypes of P.californicus.......................... 20 9 Genome annotation work flow......................... 26 10 Transcript assembly construction........................ 27 11 GeMoMa annotation work flow......................... 29 12 MAKER annotation procedure......................... 31 13 AED distribution and functional classification................ 40 14 AED distribution of unknown proteins..................... 41 15 Genome assembly sequence length distribution................ 46 16 BUSCO analysis of different P. califonicus genome assembly versions... 49 17 Missing genes in annotation........................... 55 18 Source species of functional annotation.................... 57 19 10x sequencing quality per base for forward reads.............. 62 20 10x sequencing quality per base for reverse reads.............. 63 21 RNA sequencing quality per base from forward RNA read done by Helmkampf et al.(2016)................................... 64 22 RNA sequencing quality per base from reverse RNA read done by Helmkampf et al.(2016)................................... 65 23 Duplicated transcripts in the genome..................... 66 24 Assembly self-comparison........................... 67 25 Similarity distribution for protein predictions................. 68 26 BUSCO assessment on genome assemblies................... 74 27 BUSCO assessment on transcripts....................... 75 28 BUSCO assessment on proteins......................... 75 29 TRNA classification distribution........................ 80 vi List of Tables 2 Quality parameter of P. californicus genome assemblies........... 23 3 Classification of collected ants from Helmkampf et al.(2016)........ 24 4 Summary of transcript assemblies....................... 36 5 Repeat annotation summary.......................... 36 6 Summary of GeMoMa predictions....................... 37 7 Summary of MAKER predictions....................... 38 8 Summary of functional predictions...................... 39 9 structural ncRNA predictions.......................... 42 10 Regulatory ncRNA predictions......................... 43 11 Signals from ncRNA............................... 43 12 DOGMA result summary............................ 56 13 Summary of used programs........................... 69 14 Summary of data sources from used relative species............. 71 15 Summary of genome assembly quality values................. 71 16 Summary of redundancy in published proteins and transcripts of relative species. This data is based on data sources from table 14.......... 71 17 Comparison of genome assemblies....................... 72 18 Ortho-DB v. 9: Species in Hymenoptera DB................. 73 19 RepeatMasker result table from repeat annotation.............. 76 20 Comparison of functional annotations from P. californicus. Public Gen- bank annotations and functional annotation of this thesis were compared. 79 vii Abbreviations General: cDNA complementary DeoxyriboNucleic Acid gDNA genomic DNA mRNA messenger RiboNucleic Acid ncRNA non-coding RNA miRNA micro RNA snRNA small nuclear RNA snRNP small RiboNucleoprotein Particles snoRNA small nucleolar RNA tRNA transfer RNA rRNA ribosomal RNA dNTP deoxyriboNucleoside TriPhosphate ddNTP