Automatic Functional Annotation of Predicted Active Sites: Combining PDB and Literature Mining

Automatic functional annotation of predicted active sites: combining PDB and literature mining Kevin Nagel Wolfson College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom. Email: [email protected] January 2009 Declaration This dissertation is the result of my own work, and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text. The dissertation does not exceed the specified length limit of 300 pages as defined by the Biology Degree Committee. This thesis has been typeset in 12pt font using LATEX 2"according to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. 1 Summary Kevin Nagel European Bioinformatics Institute University of Cambridge Dissertation title: Automatic functional annotation of predicted active sites: combining PDB and literature mining. Proteins are essential to cell functions, which is mainly identified in biological experiments. The structural models for proteins help to explain their function, but are not direct evidence for their function. Nonetheless, we can mine structural databases, such as Protein Data Bank (PDB), to filter out shared structural components that are meaningful with regards to the protein function. This thesis applied mining techniques to PDB to identify evolutionary conserved structural patterns, e.g. active sites. This analysis retrieved 3- and 4-bodies with assumed two- and three-way residue interaction that have been selected from a distribution analysis of residue triplets. A subset of the mined patterns is assumed to represent an active site, which should be confirmed by annotations gathered by automatic literature analysis. Literature analysis for the functional annotation of proteins relies on the extraction of GO terms from the context of a protein mention. The annotation of protein residues 2 requires the identification of chemical functions, which could be found in the context of residue mentions. MEDLINE abstracts have been processed to identify protein mentions in combination with species and residues (F1-measure 0.52; the F1-measure is a statistical measure of a test's accuracy based on the precision and recall of a test). The identified protein-species-residue triplets have been validated and benchmarked against reference data resources. Then, contextual features were extracted through shallow and deep parsing and the features have been classified into predefined categories (F1-measure ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation types in UniProtKB to assess the relevance of the annotations for ongoing curation projects. Altogether, the annotations have been assessed automatically and manually against reference data resources. All MEDLINE has been processed to filter out annotations for residues. A subset of identified catalytic sites could be cross-validated against the Catalytic Site Atlas (CSA; 44 out of 221). 429 out of 512 protein residues from MSDsite was then annotated with contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate the content from PDB. Conversely, residue annotation is achieved with a different feature set than provided from GO, and incomplete annotations in the reference datasets can be filled from public literature. 3 Acknowledgements This thesis would not have been possible without the support, direction, and love of a mul- titude of people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann for his trust, encouragements, and for all his unconditional support and guidance. Diet- rich has throughout given me opportunity and a sound research methodology. Working with him I have learned the value of vision, and persistence in achieving it. I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was interviewed by Tom, he has been inspiring, helpful and most of all patient. I will look back fondly on our discussions, the "insights" in protein science he gave me, and the cheerful and motivational chats. I am deeply indebted for his belief in me. I would like to thank my thesis committee members for their valuable and constructive comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell. They all seemed to find time for me despite their busy schedules. A special thank you must go to Kim Henrick; had he not encouraged me to pursue a research position I would not be a scientist now. I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions and especially for reminding me to keep my focus always. But most of all I will remember the great times we had cycling to and from work. I would like to thank the past and present members of the Rebholz Group (Text Mining). During my years of research, the group has expanded and I have had the chance to learn from them as well as to have fun with them within the group. 4 I am also thankful to the European Molecular Biology Laboratoy EMBL for the schol- arship and the organised EMBL International PhD programme, throughout which I have had the chance to meet many talented and cheerful PhD students from the EMBL/EBI Hinxton. A special thank you to Christina Granroth and Dagmar Harzheim, who have done the proofreading of this thesis. Thank you Dagmar for becoming clearer what I want to say. Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel. Without Almut I would have become a working maniac with no joy in life; she helped me to maintain balance during my PhD research and also for the future. My special thanks and love will go to Juli, aged one, from whom I have learned so much. 5 Contents 1 Introduction 15 1.1 Proteins and functional sites.......................... 15 1.2 Motivation.................................... 19 1.3 Objective.................................... 21 1.4 Related works.................................. 21 1.5 Challenges.................................... 23 1.6 Guide to remaining chapters.......................... 24 2 Background 26 2.1 Protein related data resources......................... 26 2.1.1 Protein Data Bank........................... 27 2.1.2 Universal Protein Knowledge base................... 31 2.1.3 Gene Ontology............................. 33 2.1.4 Biomedical literature.......................... 33 2.2 Protein structure data mining......................... 35 2.2.1 Hypothesis-driven data analysis.................... 36 2.2.2 Discovery-driven data mining..................... 37 2.3 Biomedical literature mining.......................... 38 2.3.1 Biological entity recognition...................... 38 2.3.2 Biological relation extraction...................... 39 6 2.4 Conclusion.................................... 40 3 Mining residue interactions as triads from PDB 42 3.1 Algorithms.................................... 42 3.1.1 Structural feature extraction...................... 44 3.1.2 Detection of significant configurations as interactions........ 47 3.1.3 Grouping and selecting frequent configurations............ 52 3.2 Analysing available non-redundant protein structure sets.......... 53 3.3 Evaluation methods............................... 55 3.4 Results...................................... 55 3.4.1 Identification of residue interactions is dependent on data selection 55 3.4.2 The interaction distance correlates with the distribution of residue triads.................................. 56 3.4.3 Interaction classification is sensitive to the size of cross-validation. 59 3.5 Discussion.................................... 59 3.6 Conclusion.................................... 62 4 Prediction of functions for mined residue triads 63 4.1 Evaluation methods............................... 64 4.2 Results...................................... 65 4.2.1 Identification of homologous metal binding sites........... 66 4.2.2 Validation of convergent metal binding sites............. 67 4.2.3 Recovering active sites and catalytic triads from the dataset.... 73 4.2.4 Discovering the conserved serine residue in the catalytic triad (quar- tet).................................... 75 4.3 Discussion.................................... 76 4.4 Conclusion.................................... 78 7 5 Identification of protein residues in MEDLINE 79 5.1 Algorithms.................................... 79 5.1.1 Protein and organism entity recognition............... 81 5.1.2 Entity recognition of protein residue................. 82 5.1.3 Association identification of the entity triplet organism, protein, and residue............................... 83 5.2 The construction of evaluation test corpora.................. 86 5.3 Evaluation methods............................... 88 5.4 Results...................................... 89 5.4.1 Evaluation of organism, protein, and residue entity recognition... 90 5.4.2 Performance study on the entity triplet association......... 92 5.4.3 Cross-validation of identified residues with UniProtKB....... 93 5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins..... 94 5.5 Discussion.................................... 96 5.6 Conclusion.................................... 100 6 Information extraction from the context of a residue in text 101 6.1 Algorithms.................................... 101 6.1.1 Extraction of contextual features................... 103 6.1.2 Categorisation of contextual

Automatic Functional Annotation of Predicted Active Sites: Combining PDB and Literature Mining

Dietary Supplementation with L-Arginine, Single Nucleotide

Mesenchymal Stem Cells Prevent the Progression of Diabetic

Protein Identities in Evs Isolated from U87-MG GBM Cells As Determined by NG LC-MS/MS

Chuanxiong Rhizoma Compound on HIF-VEGF Pathway and Cerebral Ischemia-Reperfusion Injury’S Biological Network Based on Systematic Pharmacology

Investigation of COVID-19 Comorbidities Reveals Genes and Pathways Coincident with the SARS-Cov-2 Viral Disease

The Function of NM23-H1/NME1 and Its Homologs in Major Processes Linked to Metastasis

DOE Human Genome Program Contractor-Grantee Workshop VI November 9-13, 1997 Santa Fe, New Mexico

The Urea Cycle Is Transcriptionally Controlled by Hypoxia

Comparative Analysis of Pacbio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identiﬁcation of Penaeus Monodon

Development and Validation of a Protein-Based Risk Score for Cardiovascular Outcomes Among Patients with Stable Coronary Heart Disease

In This Table Protein Name, Uniprot Code, Gene Name P-Value

Assessing the Human Canonical Protein Count[Version 1; Peer Review