Ensemble of Bidirectional Recurrent Networks and Random Forests for Protein Secondary Structure Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Ensemble of Bidirectional Recurrent Networks and Random Forests for Protein Secondary Structure Prediction Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias Institute of Computing, University of Campinas Campinas, SP, Brazil July 2nd, 2020 1 / 22 Agenda 1 Introduction 2 Background 3 Protein Secondary Structure Prediction Method 4 Experiments 5 Conclusions 2 / 22 Introduction Proteins are fundamental in biological processes. Chemical and physical interactions of attraction and repulsion between amino form 3-D structures. Analysis of protein secondary structure is crucial to develop possible applications. Cost of determining protein structures by laboratory methods is high. Global and local analyses can be used to predicting secondary structures. We present a method that has good results when it uses only the amino acid sequence and competitive results when it uses amino acid sequence information and protein sequence similarity. 3 / 22 Background Initially, the methods used statistical concepts. In 1980s, 1990s and 2000s, machine learning approaches emerged using sliding window. After the high accuracy achieved in Q3, a finer subclassification was created, called Q8. With deep learning, recurrent networks and convolutional networks gained space. 4 / 22 Local Classifiers Window Size 3 Window Size 5 Window Size 7 Window Size 9 Window Size 11 Ensemble of local classifiers 5 / 22 Global Classifiers Concat Concat First layer First layer First Amino acids in normal way Amino acids in inverted way Prediction made by analyzing the amino acids in the standard direction and inverted direction 6 / 22 Global Classifiers 2 Layer Bidirectional Network 3 Layer Bidirectional Network 4 Layer Bidirectional Network 5 Layer Bidirectional Network 6 Layer Bidirectional Network Ensemble of global classifiers 7 / 22 Genetic Algorithm Creation of an initial population with 2,000 individuals. Searches for weights divided into two steps after the creation of initial population. First step: Top 100 individuals generate 900 new individuals using crossover. Creation of 1,000 new individuals using mutation. This process is done for 100 generations. Second step: Top 100 individuals generate 900 new individuals using mutation. This process is done for 100 generations. In the end, the best weight is chosen. 8 / 22 Datasets PDB: Proteins up to 700 amino acids from 2018's PDB files. 20 features from amino acid sequence. 6,479 proteins for training, 500 proteins for validation and 500 proteins for testing. CB6133: Proteins up to 700 amino acids from PISCES CullPDB with less than 30% similarity. 21 features from amino acid sequence and 21 features from sequence similarity profile features. 5,600 proteins for training, 256 proteins for validation and 272 for testing. CB513: Proteins from CB6133 with less than 25% similarity for training and validation and CB513 for testing. 21 features from amino acid sequence and 21 features from sequence similarity profile features. 5,278 proteins for training, 256 proteins for validation and 513 for testing. 9 / 22 Evaluation Precision Recall Q3 Accuracy Q8 Accuracy 10 / 22 Training and Testing on PDB 81.5% Q3 Accuracy using 800 neurons. 73.1% Q8 Accuracy using 900 neurons. 11 / 22 Training and Testing on CB6133 Using amino acid sequence: 59.1% Q8 Accuracy using 900 neurons. Using amino acid sequence and sequence similarity: 73.4% Q8 Accuracy using 900 neurons. Results for CB6133 dataset using amino acid sequence and sequence similarity Methods Q8 Accuracy (%) Ensemble of methods [1] 76.3 2DConv-BLSTM [2] 75.7 biRNN-CRF [3] 74.8 DeepACLSTM [4] 74.2 CNNH PSS [5] 74.0 Our method 73.4 GSN [6] 72.1 12 / 22 Training on CB6133 and Testing on CB513 Using amino acid sequence: 55.8% Q8 Accuracy using 500 neurons. Guo et al. [4] obtained 57.1% of accuracy Q8 using only sequence information, however, the authors did not provide the precision and recall of each class. Using amino acid sequence and sequence similarity: 68.9% Q8 Accuracy using 900 neurons. 13 / 22 Training on CB6133 and Testing on CB513 Results for CB513 dataset using amino acid sequence and sequence similarity Methods Q8 Accuracy (%) Conditioned CNN [7] 71.4 DeepNRN [8] 71.1 biRNN-CRF [3] 70.9 Ensemble of methods [1] 70.9 Our method 68.9 BLSTM large [9] 67.4 GSN [6] 66.4 CNF [10] 63.3 BRNN [11] 51.1 14 / 22 Conclusions The method was capable of obtaining good results using only the amino acid sequence and it achieved competitive results when compared to the approaches that uses the amino acid sequence and protein sequence similarity. Sequence similarity information is important to improve the results, but in large datasets is not possible to generate this information. Q3 and Q8 accuracy are not the best evaluation metrics because the classes are unbalanced. For future work, modules with attention and post-processing steps can assist in the classification of less frequent classes. 15 / 22 Acknowledgments 16 / 22 ReferencesI [1] Iddo Drori, Isht Dwivedi, Pranav Shrestha, Jeffrey Wan, Yueqi Wang, Yunchu He, Anthony Mazza, Hugh Krogh-Freeman, Dimitri Leggas, and Kendal Sandridge. High quality prediction of protein Q8 secondary structure by diverse neural network architectures. arXiv preprint arXiv:1811.07143, 2018. [2] Yanbu Guo, Bingyi Wang, Weihua Li, and Bei Yang. Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional convolutional neural networks. Journal of Bioinformatics and Computational Biology, 16(5):1850021, 2018. 17 / 22 ReferencesII [3] Alexander Rosenberg Johansen, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Deep recurrent conditional random field network for protein secondary prediction. In 8th Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), pages 73{78, 2017. [4] Yanbu Guo, Weihua Li, Bingyi Wang, Huiqing Liu, and Dongming Zhou. DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinformatics, 20(1):341, 2019. 18 / 22 References III [5] Jiyun Zhou, Hongpeng Wang, Zhishan Zhao, Ruifeng Xu, and Qin Lu. CNNH PSS: protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinformatics, 19(4):60, 2018. [6] Jian Zhou and Olga G Troyanskaya. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In 31st International Conference on Machine Learning (ICML), pages 1121{1129, 2014. [7] Akosua Busia and Navdeep Jaitly. Next-step conditioned deep convolutional neural networks improve protein secondary structure prediction. arXiv preprint arXiv:1702.03865, 2017. 19 / 22 ReferencesIV [8] Chao Fang, Yi Shang, and Dong Xu. A new deep neighbor residual network for protein secondary structure prediction. In IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pages 66{71. IEEE, 2017. [9] Søren Kaae Sønderby and Ole Winther. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828, 2014. [10] Zhiyong Wang, Feng Zhao, Jian Peng, and Jinbo Xu. Protein 8-class secondary structure prediction using conditional neural fields. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 109{114. IEEE, 2010. 20 / 22 ReferencesV [11] Gianluca Pollastri, Darisz Przybylski, Burkhard Rost, and Pierre Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics, 47(2):228{235, 2002. 21 / 22 Ensemble of Bidirectional Recurrent Networks and Random Forests for Protein Secondary Structure Prediction Gabriel Bianchin de Oliveira, Helio Pedrini, Zanoni Dias Institute of Computing, University of Campinas Campinas, SP, Brazil July 2nd, 2020 22 / 22.