RNA Splice Sites Classification Using Convolutional Neural Network Models Thanyathorn Thanapattheerakul Worrawat Engchuan Daniele Merico School of Information Technology, The Centre for Applied Genomics, Molecular Diagnostics, King Mongkut’s University of Genetics and Genome Biology, The Deep Genomics, Technology Thonburi, Hospital for Sick Children, Toronto, Ontario, Canada Bangkok, Thailand Toronto, Ontario, Canada
[email protected] [email protected] [email protected] Narumol Doungpan Kiyota Hashimoto Jonathan H. Chan Faculty of Engineering, Faculty of Technology and School of Information Technology, King Mongkut’s University of Environment, King Mongkut’s University of Technology Thonburi, Prince of Songkla University, Technology Thonburi, Bangkok, Thailand Phuket, Thailand Bangkok, Thailand
[email protected] [email protected] [email protected] Abstract—RNA splicing refers to the elimination of non- completely make it loss of function. The alternative splicing coding region on transcribed pre-messenger ribonucleic acid can produce different functional proteins, which could lead (RNA). Identifying splicing site is an essential step which can to causing abnormal states in human [3]. be used to gain novel insights of alternative splicing as well as Many studies have proposed models to recognize the splicing defects, potentially cause malfunction of protein splice sites to reveal which splice sites contain a mutation resulting from mutations at splice site. In this work, we that may cause a splicing error. One common method to propose a data preprocessing step applying to RNA sequences recognize binding sites in motif sequences is called Position- and the models leveraging Convolutional Neural Network (CNN). The preprocessing step includes reducing sequence Weight-Matrix (PWM).