Interruptional Activity and Simulation of Transposable Elements

Interruptional Activity and Simulation of Transposable Elements A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy in the Department of Computer Science University of Saskatchewan Saskatoon By Lingling Jin c Lingling Jin, August 2017. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 i Abstract Transposable elements (TEs) are interspersed DNA sequences that can move or copy to new positions within a genome. The active TEs along with the remnants of many transposition events over millions of years constitute 46.69% of the human genome. TEs are believed to promote speciation and their activities play a significant role in human disease. The 22 AluY and 6 AluS TE subfamilies have been the most active TEs in recent human history, whose transposition has been implicated in several inherited human diseases and in various forms of cancer by integrating into genes. Therefore, understanding the transposition activities is very important. Recently, there has been some work done to quantify the activity levels of active Alu transposable elements based on variation in the sequence. Here, given this activity data, an analysis of TE activity based on the position of mutations is conducted. Two different methods/simulations are created to computationally predict so-called harmful mutation regions in the consensus sequence of a TE; that is, mutations that occur in these regions decrease the transposition activities dramatically. The methods are applied to AluY, the youngest and most active Alu subfamily, to identify the harmful regions laying in its consensus, and verifications are presented using the activity of AluY elements and the secondary structure of the AluYa5 RNA, providing evidence that the method is successfully identifying harmful mutation regions. A supplementary simulation also shows that the identified harmful regions covering the AluYa5 RNA functional regions are not occurring by chance. Therefore, mutations within the harmful regions alter the mobile activity levels of active AluY elements. One of the methods is then applied to two additional TE families: the Alu family and L1 family, in detecting the harmful regions in these elements computationally. Understanding and predicting the evolution of these TEs is of interest in understanding their powerful evolutionary force in shaping their host genomes. In this thesis, a formal model of TE fragments and their interruptions is devised that provides definitions that are compatible with biological nomenclature, while still providing a suitable formal foundation for computational analysis. Essentially, this model is used for fixing terminology that was misleading in the literature, and it helps to describe further TE problems in a precise way. Indeed, later chapters include two other models built on top of this model: the sequential interruption model and the recursive interruption model, both used to analyze their activity throughout evolution. The sequential interruption model is defined between TEs that occur in a genomic sequence to estimate how often TEs interrupt other TEs, which has been shown to be useful in predicting their ages and their activity throughout evolution. Here, this prediction from the sequential interruptions is shown to be closely related to a classic matrix optimization problem: the Linear Ordering Problem (LOP). By applying a well-studied method of solving the LOP, Tabu search, to the sequential interruption model, a relative age order of all ii TEs in the human genome is predicted from a single genome. A comparison of the TE ordering between Tabu search and the method used in [47] shows that Tabu search solves the TE problem exceedingly more efficiently, while it still achieves a more accurate result. As a result of the improved efficiency, a prediction on all human TEs is constructed, whereas it was previously only predicted for a minority fraction of the set of the human TEs. When many insertions occurred throughout the evolution of a genomic sequence, the interruptions nest in a recursive pattern. The nested TEs are very helpful in revealing the age of the TEs, but cannot be fully represented by the sequential interruption model. In the recursive interruption model, a specific context- free grammar is defined, describing a general and simple way to capture the recursive nature in which TEs nest themselves into other TEs. Then, each production of the context-free grammar is associated with a probability to convert the context-free grammar into a stochastic context-free grammar that maximizes the applications of the productions corresponding to TE interruptions. A modified version of an algorithm to parse context-free grammars, the CYK algorithm, that takes into account these probabilities is then used to find the most likely parse tree(s) predicting the TE nesting in an efficient fashion. The recursive interruption model produces small parse trees representing local TE interruptions in a genome. These parse trees are a natural way of grouping TE fragments in a genomic sequence together to form interruptions. Next, some tree adjustment operations are given to simplify these parse trees and obtain more standard evolutionary trees. Then an overall TE-interaction network is created by merging these standard evolutionary trees into a weighted directed graph. This TE-interaction network is a rich representation of the predicted interactions between all TEs throughout evolution and is a powerful tool to predict the insertion evolution of these TEs. It is applied to the human genome, but can be easily applied to other genomes. Furthermore, it can also be applied to multiple related genomes where common TEs exist in order to study the interactions between TEs and the genomes. Lastly, a simulation of TE transpositions throughout evolution is developed. This is especially helpful in understanding the dynamics of how TEs evolve and impact their host genomes. Also, it is used as a verification technique for the previous theoretical models in the thesis. By feeding the simulated TE remnants and activity data into the theoretical models, a relative age order is predicted using the sequential interruption model, and a quantified correlation between this predicted order and the input age order in the simulation can be calculated. Then, a TE-interaction network is constructed using the recursive interruption model on the simulated data, which can also be converted into a linear age order by feeding the adjacency matrix of the network to Tabu search. Another correlation is calculated between the predicted age order from the recursive interruption model and the input age order. An average correlation of ten simulations is calculated for each model, which suggests that in general, the recursive interruption model performs better than the sequential interruption model in predicting a correct relative age order of TEs. Indeed, the recursive interruption model achieves an average correlation value of ρ = 0:939 with the correct simulated answer. iii Acknowledgements First and foremost, I would like to express my sincere gratitude to Dr. Ian McQuillan, my Master's and Ph.D. supervisor, for his incredible amount of guidance and support in overcoming numerous difficulties that I have faced through my research. Special thanks to Dr. Longhai Li, my cognate committee member, who has provided me with lots of ideas and guidance on statistical aspects of my thesis. I am also grateful to the other members of my advisory committee: Dr. Tony Kusalik, Dr. Mark Keil, and Dr. Michael Domaratzki for their invaluable suggestions and comments. I want to acknowledge my colleagues in the Bioinformatics lab and my friends for their friendship and helpful discussions. I am grateful to my husband, Shi, and my children, April and Avery, for being the love and happiness of my life. Very special thanks goes to my husband for helping me with the implementation and graphic design in the thesis. Last but not the least, I would like to acknowledge my parents, Xin and Zhimei, and my parents-in-law, Zhendong and Shuping. Thank you for all your love and support. iv Contents Permission to Use i Abstract ii Acknowledgements iv Contents v List of Tables vii List of Figures viii List of Abbreviations x 1 Introduction and objectives 1 1.1 Motivations . 1 1.1.1 The impact of TEs on genomes . 2 1.1.2 Transposable elements causing human diseases . 3 1.1.3 Perspective . 5 1.2 Objectives and layout of the thesis . 5 2 Background 8 2.1 Biological background . 10 2.1.1 Classification of transposable elements . 10 2.1.2 Structural features of TE sequences . 12 2.2 Bioinformatics background . 19 2.2.1 Sequence alignment . 19 2.2.2 Repbase Update | the database of repetitive sequences .

Load more