Automatic functional annotation of predicted active sites: combining PDB and literature mining

Kevin Nagel Wolfson College

A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Email: [email protected]

January 2009 Declaration

This dissertation is the result of my own work, and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text. The disser- tation does not exceed the specified length limit of 300 pages as defined by the Biology

Degree Committee. This thesis has been typeset in 12pt font using LATEX 2εaccording to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee.

1 Summary

Kevin Nagel European Bioinformatics Institute University of Cambridge

Dissertation title: Automatic functional annotation of predicted active sites: combining PDB and literature mining.

Proteins are essential to cell functions, which is mainly identified in biological experiments. The structural models for help to explain their function, but are not direct evidence for their function. Nonetheless, we can mine structural databases, such as Data Bank (PDB), to filter out shared structural components that are meaningful with regards to the protein function. This thesis applied mining techniques to PDB to identify evolutionary conserved struc- tural patterns, e.g. active sites. This analysis retrieved 3- and 4-bodies with assumed two- and three-way residue interaction that have been selected from a distribution analysis of residue triplets. A subset of the mined patterns is assumed to represent an active site, which should be confirmed by annotations gathered by automatic literature analysis. Literature analysis for the functional annotation of proteins relies on the extraction of GO terms from the context of a protein mention. The annotation of protein residues

2 requires the identification of chemical functions, which could be found in the context of residue mentions. MEDLINE abstracts have been processed to identify protein men- tions in combination with species and residues (F1-measure 0.52; the F1-measure is a statistical measure of a test’s accuracy based on the precision and recall of a test). The identified protein-species-residue triplets have been validated and benchmarked against reference data resources. Then, contextual features were extracted through shallow and deep parsing and the features have been classified into predefined categories (F1-measure ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annota- tion types in UniProtKB to assess the relevance of the annotations for ongoing curation projects. Altogether, the annotations have been assessed automatically and manually against reference data resources. All MEDLINE has been processed to filter out annotations for residues. A subset of identified catalytic sites could be cross-validated against the Catalytic Site Atlas (CSA; 44 out of 221). 429 out of 512 protein residues from MSDsite was then annotated with contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate the content from PDB. Conversely, residue annotation is achieved with a different feature set than provided from GO, and incomplete annotations in the reference datasets can be filled from public literature.

3 Acknowledgements

This thesis would not have been possible without the support, direction, and love of a mul- titude of people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann for his trust, encouragements, and for all his unconditional support and guidance. Diet- rich has throughout given me opportunity and a sound research methodology. Working with him I have learned the value of vision, and persistence in achieving it. I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was interviewed by Tom, he has been inspiring, helpful and most of all patient. I will look back fondly on our discussions, the ”insights” in protein science he gave me, and the cheerful and motivational chats. I am deeply indebted for his belief in me. I would like to thank my thesis committee members for their valuable and constructive comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell. They all seemed to find time for me despite their busy schedules. A special thank you must go to Kim Henrick; had he not encouraged me to pursue a research position I would not be a scientist now. I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions and especially for reminding me to keep my focus always. But most of all I will remember the great times we had cycling to and from work. I would like to thank the past and present members of the Rebholz Group (Text Mining). During my years of research, the group has expanded and I have had the chance to learn from them as well as to have fun with them within the group.

4 I am also thankful to the European Molecular Biology Laboratoy EMBL for the schol- arship and the organised EMBL International PhD programme, throughout which I have had the chance to meet many talented and cheerful PhD students from the EMBL/EBI Hinxton. A special thank you to Christina Granroth and Dagmar Harzheim, who have done the proofreading of this thesis. Thank you Dagmar for becoming clearer what I want to say. Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel. Without Almut I would have become a working maniac with no joy in life; she helped me to maintain balance during my PhD research and also for the future. My special thanks and love will go to Juli, aged one, from whom I have learned so much.

5 Contents

1 Introduction 15 1.1 Proteins and functional sites...... 15 1.2 Motivation...... 19 1.3 Objective...... 21 1.4 Related works...... 21 1.5 Challenges...... 23 1.6 Guide to remaining chapters...... 24

2 Background 26 2.1 Protein related data resources...... 26 2.1.1 ...... 27 2.1.2 Universal Protein Knowledge base...... 31 2.1.3 Ontology...... 33 2.1.4 Biomedical literature...... 33 2.2 Protein structure data mining...... 35 2.2.1 Hypothesis-driven data analysis...... 36 2.2.2 Discovery-driven data mining...... 37 2.3 Biomedical literature mining...... 38 2.3.1 Biological entity recognition...... 38 2.3.2 Biological relation extraction...... 39

6 2.4 Conclusion...... 40

3 Mining residue interactions as triads from PDB 42 3.1 Algorithms...... 42 3.1.1 Structural feature extraction...... 44 3.1.2 Detection of significant configurations as interactions...... 47 3.1.3 Grouping and selecting frequent configurations...... 52 3.2 Analysing available non-redundant protein structure sets...... 53 3.3 Evaluation methods...... 55 3.4 Results...... 55 3.4.1 Identification of residue interactions is dependent on data selection 55 3.4.2 The interaction distance correlates with the distribution of residue triads...... 56 3.4.3 Interaction classification is sensitive to the size of cross-validation. 59 3.5 Discussion...... 59 3.6 Conclusion...... 62

4 Prediction of functions for mined residue triads 63 4.1 Evaluation methods...... 64 4.2 Results...... 65 4.2.1 Identification of homologous metal binding sites...... 66 4.2.2 Validation of convergent metal binding sites...... 67 4.2.3 Recovering active sites and catalytic triads from the dataset.... 73 4.2.4 Discovering the conserved serine residue in the catalytic triad (quar- tet)...... 75 4.3 Discussion...... 76 4.4 Conclusion...... 78

7 5 Identification of protein residues in MEDLINE 79 5.1 Algorithms...... 79 5.1.1 Protein and organism entity recognition...... 81 5.1.2 Entity recognition of protein residue...... 82 5.1.3 Association identification of the entity triplet organism, protein, and residue...... 83 5.2 The construction of evaluation test corpora...... 86 5.3 Evaluation methods...... 88 5.4 Results...... 89 5.4.1 Evaluation of organism, protein, and residue entity recognition... 90 5.4.2 Performance study on the entity triplet association...... 92 5.4.3 Cross-validation of identified residues with UniProtKB...... 93 5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins..... 94 5.5 Discussion...... 96 5.6 Conclusion...... 100

6 Information extraction from the context of a residue in text 101 6.1 Algorithms...... 101 6.1.1 Extraction of contextual features...... 103 6.1.2 Categorisation of contextual features...... 110 6.2 Evaluation methods...... 116 6.3 Results...... 117 6.3.1 Contextual feature extraction evaluated...... 117 6.3.2 Performance analysis of the classifiers...... 118 6.4 Discussion...... 121 6.5 Conclusion...... 123

8 7 Extraction of functional annotation for protein residues from MED- LINE 124 7.1 Evaluation methods...... 125 7.2 Results...... 126 7.2.1 Evaluation of the developed functional annotation extraction system 126 7.2.2 Studying mined functional annotations for the proteins p53 and Jak2129 7.2.3 Cross-validation of mined catalytic residues with CSA...... 132 7.2.4 Annotation of protein residues in MSDsite...... 134 7.3 Discussion...... 135 7.4 Conclusion...... 136

8 Combining active site prediction with mined functional annotations 137 8.1 Algorithms...... 138 8.1.1 Combining protein structure data with literature data...... 138 8.2 Evaluation methods...... 140 8.3 Results...... 140 8.3.1 Protein residue mapping between three data resources...... 140 8.3.2 Rediscovery of active sites and catalytic residues...... 142 8.3.3 Search for novel catalytic residues...... 145 8.3.4 General correlation found between predicted functional sites and extract functional annotations...... 146 8.4 Discussion...... 148 8.5 Conclusion...... 149

9 Conclusions and future work 150 9.1 Summary of main contributions...... 150 9.2 Limitations and future works...... 152

A Examples of errors in relation extraction. 171

9 B Examples of extracted functional annotations compared with UniProtKB173

C Examples of extracted functional annotations for the protein p53 177

D Examples of extracted functional annotations for the protein Jak2 183

E Examples of extracted functional annotations of the category binding event 186

F Examples of extracted functional annotations of active site residues 189

G Glossary 192

10 List of Figures

1.1 The standard amino acids...... 16 1.2 Examples of functional sites in proteins...... 18 1.3 The protein universe and its knowledge representation...... 20

2.1 Data banks in the protein universe...... 28 2.2 Three hyperlinked protein data banks...... 29 2.3 Categories for protein sequence annotation UniProtKB...... 32 2.4 GO terms are not suitable for protein residue annotation...... 34

3.1 Overview of processes and evaluation methods of the developed 3D pattern identification system...... 43 3.2 Four classes of interactions within a 3-body...... 49 3.3 Non-redundant structure set for 3D pattern mining...... 53 3.4 Distribution analysis of extracted residue triplets...... 57 3.5 Comparison of extracted residue triplets based on their interaction type.. 58 3.6 The effect of varying the cross-validation sample size on significance testing of residue interaction...... 60

4.1 A metal binding site with the 3Cys pattern in OLDFIELD...... 68 4.2 A metal binding site with the Cys-2His pattern in OLDFIELD...... 69 4.3 A metal binding site with the 3Cys pattern in SCOP40...... 70 4.4 A metal binding site with the Cys-2His pattern in SCOP40...... 71

11 4.5 Re-discovery of the catalytic triad as Asp-His-Ser pattern in OLDFIELD. 75

5.1 Overview of processes and evaluation methods for the developed protein residue identification system...... 80 5.2 Test corpora for information extraction evaluation...... 87 5.3 Identified protein residues in MEDLINE...... 95 5.4 Cross-validation of citations from identified protein residues with UniProtKB/PDB 97

6.1 Overview of processes and evaluation methods of the developed contextual feature extraction system...... 102

7.1 Performance evaluation of the functional annotation extraction system.. 127 7.2 Cross-validation of text mined catalytic residues with CSA...... 133 7.3 Cross-validation of text mined binding residues with MSDsite...... 134

8.1 Overview of processes and evaluation methods of combining the protein structure dataset and literature dataset...... 138 8.2 Lookup table for PDB/UniProtKB mapping...... 140 8.3 Overview of the combined datasets from protein structure data and biomed- ical literature data...... 141

12 List of Tables

3.1 Study on the effect of varying the interaction distance threshold in structure triangulation...... 58

4.1 Summary of extracted data at each protein structure data mining step.. 65 4.2 Identification of metal binding sites in OLDFIELD...... 66 4.3 Convergent metal binding sites identified in SCOP40...... 72 4.4 List of cross-validated active site residues...... 74 4.5 Extending the catalytic triad into 4-bodies...... 76

5.1 Regular expression patterns for the detection of residue mentions in text. 84 5.2 Performance evaluation of residue entity recognition...... 90 5.3 Performance evaluation of protein entity recognition...... 91 5.4 Performance evaluation of organism entity recognition...... 91 5.5 Performance evaluation of residue-protein-organism entity association de- tection...... 92 5.6 Performance evaluation of protein-organism and protein-residue entity as- sociation detections...... 93 5.7 A specialised performance evaluation between GC and XC2...... 94

6.1 Biological categories for the classification of protein residue related infor- mation...... 112 6.2 Category distribution in the text feature reference set...... 115

13 6.3 Evaluation of syntactical language parser performance...... 117 6.4 Performance analysis of the classifiers (confusion matrix)...... 119 6.5 Performance evaluation of the classifiers (precision, recall, F1 measure).. 120

8.1 Extracted MEDLINE information on the catalytic residues in bovine chy- motrypsinogen...... 143 8.2 Identified catalytic residues from MEDLINE extraction...... 144 8.3 Catalytic triad residues available from the mined functional annotations.. 145 8.4 Functional annotations of protein residues in predicted functional sites... 147 8.5 Homology-based transfer of extracted functional annotations for protein residues in the mined pattern data...... 148

A.1 Examples of errors in the relation extraction for the detection of contextual features...... 172

B.1 Comparison of extracted functional annotations from GC with UniProtKB. 174

C.1 Examples of literature mined annotations of protein residues in p53..... 178

D.1 Examples of literature mined annotations of protein residues in Jak2.... 184

E.1 Mined functional annotations of protein residues with information on bind- ing events...... 187

F.1 Identified catalytic triad residues from MEDLINE exraction...... 190

14 Chapter 1

Introduction

1.1 Proteins and functional sites

The genomic information encodes the blueprint to build an organism. The decoding and implementation of genetic information depends on the functions of the proteins. Each pro- tein is the result of transcribing a gene into mRNA, which is translated into a polypeptide. Hence, a protein is a gene product. The elementary units of a protein are the 20 natu- ral standard amino acids, each with four invariant parts: a central chiral alpha carbon (Cα), an amine group (NH2), a carboxylic acid group (COOH), hydrogen (H), and a characteristic side chain (R). Apart from the invariant amine and carboxylic acid group, which gives every amino acid the property of a zwitterion, distinctive physicochemical properties are defined by the side chain group. These can be polar, acidic/basic, aro- matic, bulky, conformational flexible, contain cross-linking ability, show hydrogen-bond capability, or chemical reactivity. Figure 1.1 lists all the standard amino acids and their common classification on the basis of the nature of the side chain group. During biosynthesis, ribosomes catalyses the polymerisation of amino acids through condensation and form peptide bonds between the NH2 and COOH groups of two consecu- tive amino acids. The backbone (main chain) of the resulting polypeptide is the repeating sequence of NH2-C-CO-[NH-C-CO]n-NH-C-CO. This is the primary structure of a protein

15 Amino Acid 3-Letter 1-Letter Side-chain polarity

Alanine Ala A nonpolar Arg R polar Asparagine Asn N polar Aspartic acid Asp D polar Cysteine Cys C nonpolar Glutamic acid Glu E polar Glutamine Gln Q polar Glycine Gly G nonpolar Histidine His H polar Isoleucine Ile I nonpolar Leucine Leu L nonpolar Lysine Lys K polar Methionine Met M nonpolar Phenylalanine Phe F nonpolar Proline Pro P nonpolar Serine Ser S polar Threonine Thr T polar Tryptophan Trp W nonpolar Tyrosine Tyr Y polar Valine Val V nonpolar

Figure 1.1: The standard amino acids. The trivial names, 3-letter and 1-letter abbreviations are listed along with the physicochemical properties of their side chains.

and it will fold spontaneously due to different interactions of its amino acid composition with environmental factors, e.g. solvent, salt, chaperones. The most prominent formation during the folding process is the hydrophobic core, which stabilises the protein structure. Amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, and methionine, are clustered in the interior of a protein, while charged or polar side chains are turned to the solvent-exposed surface and interact with surrounding water molecules. Minimising the exposition of hydrophobic side chains to water is the principal driving force of folding. The process of protein folding involves the formation of regular secondary structure elements (SSE), such as alpha helix and beta strand, which are stabilised by intramolecular hydrogen bonds and contacts between side chain atoms (van der Waals interaction). By following a helical path, the carboxyl group of residue i and the amino group of residue i+4 of the main chain are arranged in alignment and stabilise the local structure by hydrogen- bond formation. The side chains protrude out from the helically coiled backbone and define the surface of the helix. In contrast, beta strands are formed by hydrogen bonds between distant regions on the peptide. Depending on the direction of the peptide region,

16 two adjacent strands can be characterised as parallel or antiparallel. Because the backbone adopts almost a fully extended conformation, every side chain of i + 2 residue is facing the same direction. A set of interacting strands is called a sheet. Within the process of intramolecular stabilisation of the main chain, the regions between secondary structure elements adopt a loosely defined conformation such as turns and random coils or loops. The attractive and repulsive forces (e.g. ionic or van der-Waals interaction between residues) among the SSEs balance each other during the folding process and lead to a relatively stable and complex three-dimensional structure. Stabilisation of the conforma- tion may involve covalent bonding, e.g. disulphide bridges between two cysteine residues or the formation of metal binding-motifs. The spatial arrangement of sequentially proxi- mate or distant residues allows the generation of biochemical functional sites. To identify those and other novel biologically functional regions in the protein is one of the greatest research interests in the protein bioinformatics community, because they explain pheno- logical data, e.g. cellular processes. Figure 1.2 lists some of the well known functional sites in various proteins classified according to my own designed categorisation scheme. Finally, the formation of quaternary structure is the assembly of tertiary structures within a multi-chain protein. In this respect, each polypeptide chain is regarded as an individual functional unit (subunit or domain). Within the interfaces of the subunits, a multi-domain based functional site can be formed, which is not present or functional in the individual domains. For example, the proteins cAMP-dependent protein kinase (PDBID:1rdq), hexokinase (PDBID:1bdq), or maltodextrin phosphorylase (PDBID:1l5w) contain ligand binding sites consisting of more than one protein structure domain (A. Kahraman, pers. comm.). The identification of these multi-domain functional sites is another great challenge in protein bioinformatics. First, the prediction system has to find the correct assembly of tertiary structures (a crystal structure of a protein does not necessarily reflect the biological state of assembly). Second, the structure models have to be adjusted (proteins are not rigid molecules and have flexible parts), and finally

17 site 1. evolutionary site 1.1. conserved site 2. functional site 2.1. interaction site 2.1.1. active site 2.1.1.1. catalytic site / reactive site 2.1.1.1.1. catalytic residue 2.1.1.1.2. donor site 2.1.1.1.3. acceptor site 2.1.1.2. binding site / contact site / substrate binding site / ligand binding site / binding site / recognition site 2.1.1.2.1. specificity residue / specific site 2.1.1.2.1.1. high affinity binding site 2.1.1.2.1.2. low affinity binding site 2.1.1.2.2. peptide binding site 2.1.1.2.3. protein binding / receptor site 2.1.1.2.3.1. nf kappab site 2.1.1.2.3.2. antibody binding site 2.1.1.2.3.3. antigen binding site 2.1.1.2.3.4. actin binding site 2.1.1.2.4. sugar binding 2.1.1.2.5. lipid binding 2.1.1.2.6. nucleic acid binding 2.1.1.2.6.1. atp binding site 2.1.1.2.7. metal binding site 2.1.1.2.7.1. calcium binding site / ca(2+) binding site 2.1.1.2.7.2. copper site 2.1.2. passive site / target site 2.1.2.1. cleavage site / lesion site / processing site / proteolytic cleavage site 2.1.2.2. PTM site 2.1.2.2.1. phosphorylation site 2.1.2.2.1.1. tyrosine phosphorylation site 2.1.2.2.2. glycosylation site 2.1.2.2.3. regulatory site 2.1.2.2.4. inhibitory site 2.1.2.2.5. activation site 2.2. structural site 2.2.1 hydrophobic site 2.2.1.1 hydrophobic core 2.2.1.2. hydrophobic patch 2.2.2. n terminal site 2.2.3. c terminal site 2.2.4. transmembrane site 2.2.5. intracellular site / cellular site 2.2.6. extracellular site 2.2.7. anionic site 2.2.8. cationic site 2.2.9. nucleation site

Figure 1.2: Examples of functional sites in proteins. A proposition of a classification scheme (excerpt) is represented based on my own perspective of biomolecular function of specific residue configurations in protein structures.

18 co-factors, e.g. metal ions, have to be considered.

1.2 Motivation

The understanding of the biological function of proteins remains a central challenge in biology. Our knowledge of the protein universe can be partitioned into at least three knowledge spaces (cf. figure 1.3): protein sequence space, protein structure space, and protein function space. Each space represents a specific view of proteins. For example, the protein structure space contains information about the number of biological conformations of protein structures (cf. figure 1.3, top panel). Whereas, the function space describes the spectrum of protein function. Although information from each space partially overlaps, only little data are available to explain their relationship. For example, site-directed mutational analysis is often reported in context of gain or loss of a protein function, while the biological correlation between sequence and function is not understood. This is because the mechanism of protein function is not explained by information within sequence space. In contrast, structural data are more expressive than sequence data, because a protein structure provides spatial context of residues. Proteins are physical entities and as such, they perform interactions with other proteins or ligands. The shape of a protein, or more precisely, the spatial configuration of a set of residues in a functional site, is one explanation for protein function. While protein structure data mining is concerned with the prediction of novel functional sites in proteins, a mined structural pattern has no evidences of biological function. In contrast, biomedical literature reports a range of biological function of protein residues without a structural context and explanation of molecular mechanism (cf. figure 1.3, middle panel). The combination of information from protein structure space and protein function space seems to be an obvious approach in order to gain new knowledge on protein function.

19 Figure 1.3: The protein universe and its knowledge representation. Information on a protein can be col- lected from at least three different knowledge domains: crystallography provides the spatial coordinate of a protein, protein sequencing determines the linear composition of amino acids in a protein, and biochem- ical experiments characterises the biological function (top panel). In principle protein function prediction can be done based on information from each domain knowledge spaces, however the combination of them can overcome some domain specific limitations (middle panel).

20 1.3 Objective

This thesis aims to discover hypothetical functional sites from Protein Data Bank (PDB) and annotate them with functional information from biomedical literature. The main idea is to combine the information from currently two detached data resources, protein structure information from PDB, and functional annotations of residues from MEDLINE (cf. figure 1.3, lower panel). More specifically, this research focuses on the prediction of active sites by data mining recurrent spatial residue configurations (3D pattern) in pro- teins. Contextual features of residues are extracted from biomedical literature to provide functional annotations. The results from both datasets are then combined to verify pre- dicted functional sites by evidences of biological function. While existing approaches in protein structure data mining and biomedical literature mining has been used to generate data for each research domain, the combination of the datasets is a novel approach in protein bioinformatics research.

1.4 Related works

To verify a predicted protein function with functional annotations extracted from biomed- ical literature, two different levels have to be considered: the protein level, and the residue level (i.e. groups of residues forming a functional site). The recent publication of [JGLRS08] is one example for case (1): The prediction of protein function is based on the search for a conserved and connected subgraph (CCS) in protein-protein interaction graphs, generated from several biological databases. Within the set of CCS, all available functional annotations of a protein in a database are trans- ferred to homologous proteins. The annotations consist of (GO) terminolo- gies and the transfer is the prediction of protein function. The verification of a predicted function was done by identifying GO terms in abstract texts of the corresponding protein. The approach of this thesis has some similarities to this report [JGLRS08], e.g. in

21 both approaches, results from data mining were verified by information extracted from biomedical literature. However, there are crucial differences between the two that need to be considered when assessing the result of this thesis. First, in contrast to the CCS identification, the data mining part in this work does not aim to identify known patterns, but wants to discover new structural features that may represent a novel functional site. Secondly, in [JGLRS08] the prediction of protein function utilises terminologies of a well- developed public resource, the Gene Ontology, while the same resource is not suitable for annotation of protein residues. This is because GO is designed to describe function of and gene products. From a conceptual point of view, terminologies in GO de- scribe a high level of biological function, while the description of residue function are of a lower level. For example, description of protein-protein interaction is found in context of metabolomics, signal-transduction or other cellular processes. In contrast, the function of a protein residue can be explained in light of molecular interactions or chemical reaction mechanisms. Finally, the distribution of information on biological function is expected to be different in biomedical publications. Because protein function is conceptually a high level of biological function, it is likely that abstract texts of biomedical articles contain information on this level. Conversely, the interaction of protein residues is a detailed de- scription of protein function, and key information are expected to be mentioned in results or discussion sections of full-text articles. To my knowledge, the most related relevant work in terms of functional annotation of protein residues (case (2)) is the system called Mutation extraction and STRucture Annotation Pipeline (mSTRAP) [KCRB07]. The key feature of mSTRAP is the visualisation of mutation annotations, which is projected onto a structure of a protein of interest. The advantage of mSTRAP is to interpret impacts of mutation in context of the protein structure. However, the prediction of functional sites is done by visual analysis of the protein structure. The provided annotations are sets of complete sentences extracted from MEDLINE, which means that the interpretation of the information requires expert knowledge.

22 The developed system in this work differs from mSTRAP, in that the extracted infor- mation is not exclusively used to annotate point mutations, but rather other functional descriptions of wild-type residues are also collected. Another distinction to mSTRAP is, the mined information is represented in a so called predicate-argument structure (PAS) format; only relevant text segments from sentences are extracted that describe a biological function or a biological context of a mentioned residue. The structured format allows to some extent queries for specific information in the extracted annotation dataset. In conclusion, only few related works have been reported that describe an automated system to verify a predicted protein function by using functional annotations extracted from the literature. This work retains its originality, because it aims to find novel func- tional sites in proteins by mining the PDB, and by extracting functional annotations from a wide range of biomedical literature data.

1.5 Challenges

Is it possible to identify a functional site, e.g. an active site, on the basis of mining PDB and the literature, and then combine the information of both? We can expect that a significant population of similarly arranged residues in a protein can be identified from a non-redundant protein set, if this evolutionary conserved interaction provides a functional or structural advantage. We can also expect that residues are mentioned in conjunction with their corresponding protein, and that the biological role of a protein residue is reported in context of gain or loss of function of the overall protein in biomedical literature. One task presented in this thesis is the identification of textual features as functional annotation. The problem differs from other information extraction tasks, e.g. the anno- tation of proteins, because the target is to provide knowledge on the biological role of a residue. For example, to extract protein-protein interactions from text, a list of protein names is used, and the task is reduced to finding only associations between listed pro-

23 teins. In contrast, to extract a protein residue and its corresponding biological function is difficult, because an adequate dictionary of terms is not available.

1.6 Guide to remaining chapters

Chapter2 presents background knowledge that are important for this work. Four dif- ferent data resources are reviewed and their limitations discussed in context of this thesis. Then follows an explanation of methods in the field of protein structure data mining and biomedical literature mining. Some of the introduced methodologies are reused in this work, while ideas and approaches of others were adopted to develop task specific extraction systems.

Chapter3 describes the developed protein structure data mining system for the iden- tification of 3D patterns in PDB. Algorithms for the identification of conserved spatial residue configurations are explained and the effects of algorithm-related and data-related parameters are discussed.

Chapter4 demonstrates the biological implication of the mined 3D patterns from chap- ter3. Two examples of rediscovered functional sites in proteins are shown to justify the presented data mining approach. The first biological validation is the identifica- tion of metal binding sites, while the second validation is the rediscovery of catalytic triad from the mined data.

Chapter5 is the first of three text mining chapters in this thesis. It explains the de- veloped protein residue identification system, which consists of two main modules: biological entity recognition of residue, protein, and organism, and association de- tection of the entity triplet.

Chapter6 describes the approach to detect contextual features of a mentioned residue in text. An automatic method is introduced to assign semantic labels to the extracted

24 textual features.

Chapter7 presents the third part of the three text mining chapters. Both text mining modules from the previous chapters (protein residue identification, and contextual feature extraction) are combined to form the functional annotation extraction sys- tem. The overall performance of this information extraction system is studied. The validity of the extracted information as functional annotation is demonstrated by manual analysis on two example proteins (p53 and Jak2), and by cross-validation of identified catalytic or binding residues with two reference databases: CSA and MSDsite.

Chapter8 presents results on combining protein structure data with literature data. The validity is studied by examining the correlation of predicted active site residues with enzyme-related functional annotations.

Chapter9 summarises the thesis and presents limitations and open questions for follow up research.

25 Chapter 2

Background

In the previous chapter, I have presented the motivation and objective of this thesis. The purpose of this chapter is to familiarise the reader with relevant concepts in protein science, data mining, and literature mining. The limitations of each reviewed data resource or methodology are discussed in context of this research work.

2.1 Protein related data resources

Proteins are both building blocks of cellular structures and the major machinery in cells. In order to perform their functions, proteins need to fold into their three-dimensional structures and thereby form functional sites. The prediction of a structural pattern as- sociated with a biological function is an important aspect in protein bioinformatics. To interpret the multiple functions of proteins, annotations are linked with results from bioinformatics analysis tools. In addition, data are extracted from generic and specific databases, biological knowledge accumulated in literature, and data from genome-wide experiments, such as transcriptomics and proteomics, are collected. One major goal is to describe protein function within biological context by using a standardised hierarchical classification scheme and controlled vocabulary. The biological community has developed databases and functional annotation schemes

26 that are not only used to archive protein data, but also to describe protein function on a molecular, cellular and phenotypical level. Figure 2.1 shows some of the most popular and relevant databases in the field of protein bioinformatics. These protein-related data resources are hyperlinked in order to foster bioinformatical research works. A statistic of three example databanks and their hyperlinked references is given in figure 2.2.

2.1.1 Protein Data Bank

The Protein Data Bank (PDB) is an archive of 3D structures of large biological molecules, such as proteins and nucleic acids. Currently, PDB lists 43,099 proteins determined by crystallography (version November 2008). Despite the large amount of structure data available for a range of proteins, the information in the PDB has three significant lim- itations. First of all, the structure data have a low correlation with sequence data. In comparison to the sequence data in UniProtKB (cf. section 2.1.2), the coverage of the se- quence space is much larger than the structure space. Therefore, the derived information from PDB is only applicable to a limited set of proteins. The second limitation is the coverage of annotation available for proteins. In the PDB, there are some facilities to annotate proteins, for example the SITE record is used to annotate protein residues that are part of active sites. However, annotations are not mandatory and many other sites are not updated, although new evidences of biological functionality of these residues were found. An automatically derived database called PDB- SITE [IPGK05] stores the SITE record information and makes the search for these data accessible. Another, rather predictive, database of functional sites in protein structures is the MSDmotif [GH08], which provides information about ligands, sequence and structure motifs, their relative position, and their neighbour environment. Another database of pre- dicted functional sites is MSDtemplate [Old02], which contains small fragments generated by data mining on a structurally unique protein set from PDB. Examples of biologically relevant fragments were identified in this data collection, such as the catalytic triad and

27 Figure 2.1: Data banks in the protein universe. This figure shows my interpretation of how our knowl- edge about proteins can be categorised. A selection of the most relevant data resources and web services are reproduced in this figure. UniProtKB = Universal Protein Knowledge base [WAB+06]; PIR = Protein Information Resource [BGH+00]; PDB SELECT = representative list of PDB chain identifiers [HSSS92]; PISCES = Protein Sequence Culling Server [WD03]; UniqueProt = web-service to create representative protein sequence sets [MR03]; MEROPS = the Peptidase Database [RMK+07]; CAZy = Carbohydrate- Active enZYmes [CCR+08]; TC-DB = Membrane Transport Protein Classification Database [STB06]; PMD = Protein Mutant Database [KON99]; Phospho.ELM = a database of S/T/Y phosphorylation sites [DCG+04]; PROSITE = Database of protein domains, families and functional sites [HBB+08]; PRINTS = Protein Motif Fingerprint Database [Att02]; BMC = Biomedical Center [BMC08]; PMC = PubMed Central [PMC08]; PDB = Protein Data Bank [BWF+00]; SCOP = Structural Classification of Proteins [HMBC97]; CATH = Class, Architecture, Topology, Homologous superfamily - Protein structure classi- fication [OMJ+97]; Relibase = database of protein-ligand complexes [HBGK03]; CSA = Catalytic Site Atlas [PBT04]; MSDmotif = an integrated resource of protein structure motifs.

28 Figure 2.2: Three hyperlinked protein data banks. Illustrated is the size of three databanks, PDB, UniProtKB, and MEDLINE, along with their cross-references. For example, the PDB contains in total 42,943 PDB identifiers (version November 2008) with cross-references to 42,085 out of 333,445 Uniprot identifiers, which in return points to 10,466 biomedical journal articles (PMIDs). Notice that PDB also holds for each record a small number of primary citations, however, these are mainly pointers to crystallographic publications and provide little hints of biological function of the protein or annotation of functional sites.

29 various metal binding sites. The Catalytic Site Atlas (CSA) [PBT04] is another database documenting active sites in enzymes of 3D structures. The data are either manually curated or predicted, based on searches for homologous proteins. Another serious limitation of PDB is its use for statistical analysis of structure data. The PDB represents a redundant and biased snapshot of the protein universe. Redun- dancy is due to the fact that many highly similar structures or identical folds are deposited in the database leading to an over-representation of some proteins. In the past, struc- ture determination has been guided by hypothesis-driven experiments, short-listed target proteins in the medical or commercial field, and by the methodologically tractable small proteins for crystallisation. Consequently, the fold-space has not been fully explored yet. Although techniques in protein crystallography are improving, there are still other underrepresented proteins, e.g. membrane proteins or large proteins, which define the boundaries of representativeness of the structure data. While there is little we can do about exploring the complete ensemble of folds from a bioinformatics point of view, the over-representation can be filtered. For example, protein sequence based clustering [AGM+90][AMS+97] is the principle method to produce the following datasets: PDB SELECT [HSSS92], PISCES [WD03], UniqueProt [MR03]. However, this approach is limited by the assertion of sequence-structure relation in the so called twilight zone, i.e. below 30 per cent sequence identity proteins may or may not have similar folds [Ros99]. Another critical issue with sequence based clustering is the comparison of protein chain sequences rather than the alignment of segments defined by protein domain boundaries. Structure based approaches cluster the data on the basis of domain structures. Several databases of domain based structure clustering were created with the most prominent ranging from entirely manual work (SCOP [HMBC97]), semi-automatic approach (CATH [OMJ+97]), to entirely non-supervised methods (FSSP-Dali, [HS94]). Differences in these classification were studied by [HJ99] and [DBAD03].

30 2.1.2 Universal Protein Knowledge base

The major repository of protein sequence data is the Universal Protein Knowledge base (UniProtKB). Along with the collection of sequence data is the listing of protein names and synonyms, taxonomic data, citation references, and other manually curated infor- mation from literature survey. One important aspect of UniProtKB when evaluating structure-function relationships is the annotation of protein residues. In the feature table the biological function of a residue site is described along with several other key categories (cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific annotations (version from January 2008). Despite the high quality data contained in UniProtKB, the process of extracting func- tional annotations from literature remains a laborious human expert curation work. The curator surveys the biomedical literature, represents the experimentally determined func- tional information, and formulates the precise functional role by utilising standardised semantic resources (cf. section 2.1.3). Despite the highly reliable quality of manual cura- tion, this approach is evidently inefficient considering the amount of full-text publications curators have to distil. According to Frishman, if we assume

”[...] that one needs on average roughly 30 min to assess published fact and bioinformatics evidence for one protein, one thousand annotators would have to work 1 year long, 8 h a day, to annotate all 5 million sequences that are currently known. However, since the size of the protein database has been consistently doubling every 18 months, the moving target of annotating all proteins will never be achieved.” [Fri07]

Considering that the estimated total number of proteins is in excess of 1010 [CK06], an automatic or semi-automatic solution is needed to facilitate the laborious human ex- pert work. Currently, methods for the automatic expansion of citation set [YLPV07] [HLC04][LHC07] and the automatic annotation of protein function with GO terminolo- gies [CSL+06][GJYLRS08][RSKA+07] are being developed in the field of text mining.

31 Key Description

INIT MET Initiator methionine. SIGNAL Extent of a signal sequence (prepeptide). PROPEP Extent of a propeptide. TRANSIT Extent of a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.). CHAIN Extent of a polypeptide chain in the mature protein. PEPTIDE Extent of a released active peptide. TOPO DOM Topological domain. TRANSMEM Extent of a transmembrane region. DOMAIN Extent of a domain, which is defined as a specific combination of secondary structures or- ganised into a characteristic three-dimensional structure of fold. REPEAT Extent of an internal sequence repetition. CA BIND Extent of a calcium-binding region. ZN FING Extent of a zinc finger region. DNA BIND Extent of a DNA-binding region. NP BIND Extent of a nucleotide phosphate-binding region. REGION Extent of a region of interest in the sequence. COILED Extent of a coiled-coil region. MOTIF Short (up to 20 amino acids) sequence motif of biological interest. COMPBIAS Extent of a compositionally biased region. ACT SITE Amino acid(s) involved in the activity of an enzyme. METAL Binding site for a metal ion. BINDING Binding site for any chemical group (co-enzyme, prosthetic group, etc.). SITE Any interesting single amino-acid site on the sequence, that is not defined by another feature key. It can also apply to an amino acid bond which is represented by the positions of the two flanking amino acids. NON STD Non-standard amino acid. MOD RES Posttranslational modification of a residue. LIPID Covalent binding of a lipid moiety. CARBOHYD Glycosylation site. DISULFID Disulfide bond. CROSSLNK Posttranslationally formed amino acid bonds. VAR SEQ Description of sequence variants produced by alternative splicing, alternative promoter usage, alternative initiation and ribosomal frameshifting. VARIANT Authors report that sequence variants exist. MUTAGEN Site which has been experimentally altered by mutagenesis. CONFLICT Different sources report differing sequences.

Figure 2.3: Categories for protein sequence annotation in UniProtKB. Key categories used to describe regions or sites of interest in a protein sequence are listed. The key and the corresponding information (value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their definitions presented in this figure.

32 Clearly, the annotation for a whole protein cannot be transferred to residue site annota- tion, because different groups of residues in the protein structure have different function. In this respect, the biological community is missing an information extraction system for the annotation of proteins at residue level.

2.1.3 Gene Ontology

The Gene Ontology (GO) [AL02][GOC06] is one of the most widely used functional classification scheme including all of the most important criteria for annotations of bio- logical data [PKS06]. Currently, the ontology lists a total of 26,302 terms with 15,643 biological process terms, 2,233 cellular component terms, and 8,426 molecular function terms (version November 2008). The UniProtKB/InterPro group at the European Bioin- formatics Institute (EBI) belongs to the Gene Ontology Consortium, and use its standard vocabulary to the annotation of protein function. The vocabulary is meant to describe biological phenomenology of genes and gene products (proteins). This is the reason why terminologies in GO are not suitable to describe the function and property of a protein residue. Figure 2.4 lists some examples where the identification of GO terms [GJYLRS08] did not find the more relevant keywords for the annotation of residues. At the moment, an ontology dedicated solely for the functional annotation of protein residues has not been developed. However, terminologies can be in general collected from other considerable re- sources, such as the Open Biomedical Ontologies [SAR+07] which contains, for example, REX (an ontology of physico-chemical processes), and PSI-MOD (an ontology describing protein chemical modifications).

2.1.4 Biomedical literature

Biomedical research tackles biological questions from a number of perspectives and the published experimental data are always heterogeneous. The sum of description of biolog- ical phenomenon enables scientists to understand mechanisms in biology within various

33 Annotation Sentence Manual GO

”The catalytic mechanism of the thioester mechanism, con- glyceraldehyde-3-phosphate non-phosphorylating glyceraldehyde- served cysteine dehydrogenase (NADP+) 3-phosphate dehydrogenase and the (phosphorylating activity), other aldehyde dehydrogenases resem- glyceraldehyde-3-phosphate bles a thioester mechanism involving biosynthesis, glyceraldehyde- the universally conserved cysteine 298 3-phosphate catabolism, phos- (pea GAPN).” (PMID:9461340) phoglycerate dehydrogenase activity

Annotation Sentence Manual GO

”However, mutations of a key residue, protection of the complex from AT DNA binding, tRNA, tyro- His48, show significant deviation from hydroxide attack sine tRNA ligase activity the relationship, implying a role for the side chain in protection of the complex from hydroxide attack.” (PMID:2690955)

Annotation Sentence Manual GO

”Second, this reactive cysteinyl L-cysteine desulfurization ac- pyridoxal biosynthesis, phos- residue, which is required for L- tivity phate binding, mutagenesis, cysteine desulfurization activity, was nitrogenase activity, L-alanine identified as Cys325 by the specific biosynthesis, pyridoxal phos- alkylation of that residue and by site- phate binding directed mutagenesis experiments.” (PMID:81615929)

Figure 2.4: GO terms are not suitable for protein residue annotation. The presented examples demon- strate that predicted GO terms are not always suitable for protein residue annotation. The prediction of GO terms was done with an information theory based parser [GJYLRS08].

34 contexts. This summary of text has also been compared with an ”unstructured knowledge database”, where information is present, but difficult to retrieve due to the complexity of natural language. According to Sidhu,

”[...] it is generally acknowledged that only 20 per cent of biological knowl- edge and data is available in a structured format or a database. The remaining 80 per cent of biological information is hidden in the unstructured, free text of scientific publications.” [SDC06]

In context of information extraction, the data to be extracted from an article are words (keywords) regarding biological concepts that could summarise the key message of the article. At first glance, abstract texts have a high density of keywords but a low coverage of information, while full-texts cover a larger but disperse quantity of data [FKY+01][YHF+02][SPIBA03][SWS+04][NBD+06]. Another key distinction between abstract texts and full-texts is the availability of data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE without restriction, while full-texts from various journals are only available for subscribed customers. Although some full-text articles are accessible through various initiatives [BMC08][Plo08][PMC08], the extraction of information from a whole document is ex- pected to be much more complex than from an abstract text. For example, a biological feature of a residue may be expressed over several sentences, requiring a co-reference resolution of the residue and the feature.

2.2 Protein structure data mining

Data mining is an analytic method to identify valid, and novel patterns in data. A general data mining solution does not exist. Instead human data mining expertise and human domain expertise are required to solve each specific data mining problem. A data mining

35 process consists of the following main processes: data selection, feature extraction, and correlation analysis. In respect of protein structure data mining, data selection means the identification of a non-redundant set of protein structures from PDB (cf. section 2.1.1). Although a protein structure contains only geometrical information, it is important to distinguish the types of structural features to be analysed. Following are the options of structural feature as target: the configuration of amino acids as Cα, the configuration of backbone atoms, the spatial arrangement of chemical groups [JIDG03][YEC+07][Rus98][SSR03] [Old02], and the physicochemical environments [OCR01][YEC+07]. In order to discover new information from the data, a developed data mining algorithm must not contain any biochemical knowledge. The target should be a mathematical model and not a biological template.

2.2.1 Hypothesis-driven data analysis

”Within the field of bioinformatics research, the term data mining is used very loosely to describe any type of data analysis. (T. Oldfield, pers. comm.).” Hypothesis-driven data analysis consists of defining a biological target (hypothesis), and searching for the target. Consequently, the result of a hypothesis-driven data analysis is not the discovery of new information. A number of methods were published that predicts a known protein function on the basis of protein structure information. Initially, the research work focused on global fold recognition [HS96][WR97][MB99][KH04][HPS+03][AZP+05] to identify evolutionary distant, but structurally conserved homologues. Once a match is found functional an- notations are transferred from the target to the query. Another more specific approach focuses on the search for matching local substructures in the proteins. The rational is, that a biological function can be mapped to a particular residue configuration in the protein, which is independent in function from the global fold of the structure. One obvi-

36 ous approach was to design structure templates, which contains all the essential residues for a biological function. Several specific types of sites or motifs have been studied in detail to capture metal binding sites [Glu91], the catalytic triad of the serine proteases [FWLN94][WBT97], and binding sites for anions such as sulphate and phosphate [Cha93] [CB94]. Computer assisted methods were developed in the following to help experts to design templates by analysing motifs over large sets of proteins corresponding to active sites [APG+94][Rus98][SSR03][Kle99][FS98][FGS98][WBT97][BT03][PB06], surface patches or clefts [Las95][KJ94][LEW98][SPNW04] [BFL04] or structural binding site locations [GPP+03][KN03].

2.2.2 Discovery-driven data mining

The key feature in a discovery-driven data mining is the search for common characteristics (pattern) in the data, without providing any domain knowledge. More specifically, the target is mathematically defined and the system aims to identify over-representations, data variations, or singularities in the dataset. Hence discovery-driven data mining can deliver novel information, while the biological significance of the result is not trivial. One important aspect in identifying residue interactions in protein structures is the consideration of contextual information, such as interaction distance, chemical environ- ment, and evolutionary conservation, in the data mining algorithm. The systems called ET/MA [CFK+05] and ConSurf uses evolutionary information in combination with struc- tural and chemical data, in order to highlight region of local structures with functional importance. In contrast, the systems PINTS [Rus98][SSR03] and SIDEMINE [Old02] find patterns within the distribution of non-redundant structure set, by using solely mathe- matical model of interactions. One critical issue in the development of these data mining methods was the improvement of the signal/noise ratio. In order to boost the signal fre- quency, two structural features are merged if one is biologically equivalent to the other. While the analysis showed that the mined output contained biological valid data, the

37 result actually incurs some bias, because biological knowledge was introduced.

2.3 Biomedical literature mining

Biomedical text mining extracts information from text for the integration into biological databases. Due to the complexity of natural language, text processing involves structur- ing the text input by means of parsing and the annotation of some linguistic features, e.g. part-of-speech tags. The majority of biological text analysis is concerned about the extraction of explicitly stated facts from text; a task referred as biological information ex- traction [Hob02]. Biomedical text mining processes typically consist of two main analysis steps: biological entity recognition, and biological relation extraction. The vast amount of published biomedical articles contains phenomenological data on proteins, such as their molecular function. The information is encoded in unstructured text and requires different level of complexity to mine the data. There are several levels of text mining challenges to extract functional annotation: the identification of mutations [LHC07][WK07][BW05][RSMA+04][HLC04] or genetic sequences [MG03], identification of gene or protein names [RSAG+08][PJYLRS08][TMA08][Fuk98] and chemical entities [CMR06], the extraction of annotation of molecular function [GJYLRS08][RSKA+07] [DS05][KNT05][GDAW03] [HNR+05], and the identification of semantic relations be- tween the biological entities [BLK+08][LCM03][SB06].

2.3.1 Biological entity recognition

The process of entity recognition (ER) can be split into three parts: location of the men- tioned entity in text, classification of the entity into a predefined category, and normalising the entity by referencing to an entry in a database. Biological entities are often ambiguous in terms of their boundaries and categories. Probably the most challenging task is the correct identification of protein or gene names.

38 For example, ”hunchback” is a protein in Drosophila, while it is also a general English term. Furthermore, protein names consist mostly of multiple words, e.g. ”Rho-like pro- tein” or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the constituents of a protein name in order to relate the detected entity to its reference entry in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask; the target is the identification of protein/gene names in text, and the annotation of their correct gene identifier. Various solutions were published ranging from rule-based meth- ods [HFM+05][TW02][Fuk98] to machine learning approaches [CMP05]. The developed methods are, in general, reusable for any other biological entity recognition or terminology identification problem. Works have also been published that focused on the extraction of protein point muta- tions [RSMA+04][HLC04][BW05][LHC07][YLPV07], which is one category of protein residue terminology. Other categories are residue sequence or residue interaction pair. The most widely adopted method to identify these terminologies is the design of regular expression patterns.

2.3.2 Biological relation extraction

Relation extraction (RD) aims to find associations between entities, or between an en- tity and a terminology within a text phrase. One objective in biomedical information extraction is the mining of biological facts from text. An example of biological fact is the semantic relation between two biological entities, such as protein-protein interaction [TOT04]. Until now, three strategies have been investigated for biological relation extraction: the co-occurrence based analysis [LC05][SB05], pattern-based approach [HZH+04][LCM03], and machine learning based methods [BM05][BM06]. The common limitation of all of these extraction systems is, that only the relation targets, e.g. proteins within a protein- protein interaction, are extracted. By no means are contextual information considered in

39 the extraction that would describe or explain the association of the entities. Within the information extraction community, a consensus has been reached, that deeper analysis of sentence structures is required in order to adequately acquire biomolecular relations from text [WSC04]. In respect of biological relation extraction, two classes of syntactical parsers were stud- ied. The first is the shallow parsing technique, which aims in detecting main constituents of a sentence, without determining the complete syntactical structure. Results were pub- lished, where protein-protein interactions [KNT05] and general biological entity relations [LCM03] were extracted based on shallow parsing. The second class of syntactical parser is the full parser, which attempts a deep analysis of the syntactical structure of a sen- tence. Several systems have been reported [NED03][FKY+01] that utilises full parsing for relation extraction from biomedical literature. One interesting full parser is ENJU [YMTT05][MT05], a so called head-driven phrase structure grammar (HPSG) parser, which identifies predicate-argument structure (PAS) from a text sentence. The use of PAS, as template for biomolecular relation extraction, was firstly reported in [TOT04][YMTT05]. Recently, two proposition bank were reported, that are designed to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS+07]. Within this work, there are two types of semantic relations to be extracted. The first is the residue-protein association. The system called MEMA [RSMA+04] uses a word distance metric to associate a list of residue-protein pairs with the smallest word distance. Another approach is to look up valid associations between a residue and a protein in context of a predetermined association of a protein and an organism. Three systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner [BW05], and MutationGraB [LHC07]. The other semantic relation to be extracted in this work is the association between a residue entity and its description of function. The systems MuteXt [HLC04], MEMA [RSMA+04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to

40 the extraction of point mutations, but provide no extraction of functional annotation. In a recent publication [WK07], an ontological model was proposed that should hold infor- mation extracted from MutationMiner as well as point mutation annotations. However, the author did not provide any results of feature extraction nor was a strategy proposed.

2.4 Conclusion

In this chapter, I have reviewed some of the most relevant data resources and research works in the field of protein structure data mining and text mining. Some of the data resources are used in this thesis. In the following, I will present the extraction systems I have developed during my PhD.

41 Chapter 3

Mining residue interactions as triads from PDB

In this chapter, I present a novel approach in mining 3D patterns from protein structures. More specifically, a pattern is defined as the irreducible interaction of a chemical and spatial configuration of residues. The goal is to identify new information from a non- redundant dataset on the basis of using solely mathematical targets. The mined 3D patterns represent prediction of functional sites in proteins.

3.1 Algorithms

The novelty of this presented 3D pattern mining approach is based on the classification of residue triplets into one of four interaction classes. The idea of analysing side chain inter- actions within a residue triplet is based on the work of [Old02], while the classification of residue interaction relies on the methodology developed by [JB04]. The developed data mining method consists of three processing steps: structural feature extraction, detection of significant configurations as interactions, and grouping and selection of frequent config- urations. Figure 3.1 illustrates the procedures of the entire protein structure data mining system developed in this thesis.

42 Figure 3.1: Overview of processes and evaluation methods of the developed 3D pattern identification system.

43 3.1.1 Structural feature extraction

Theory

Residue triplet as spatial pattern unit. The presented protein structure data min- ing algorithm aims to identify significant interaction of residues within a triplet configu- ration. The rational of analysing residue triplets is described in the following. In order to form a functional site in a protein structure, residues need to be physically in closed con- tact. In other words there exists a mutual dependency or interaction among the residues. The interaction can be studied on a two-residue basis (doublet 3D pattern). However, regarding the size of structure data the probability of any two-residue configurations is too high to be detected as specific. Hence, the signal/noise ratio issue is the reason why a two-residue 3D pattern is not the target of protein structure data mining [Old02]. A two residue contact is completely defined by a scalar property, while a three residue contact is defined by vectors. Consequently, a three residue constellation encodes much more information. This makes information theory based methods tractable to find con- served residue interactions as signals. In reality, functional sites can be composed of more than three residues, e.g. various metal binding sites used four coordinative cysteine residues. However, data sparseness and the mathematical complexity [CL64][Sin04] in modelling four or larger residue inter- actions makes it infeasible. In principle, the more variables are introduced in modelling residue interactions, the more specific the data mining. It should be noted, that the iden- tification of N-body interactions of residues can be solved from a combinatorial approach. Two triplets are combined, if there is equality in two out of three residues from each triplet [Old02]. This approach was adopted in this study to demonstrate that larger in- teraction configurations are extractable. However, this investigation concentrates mainly on the identification of three residue interactions. The assumption is, that if the output of a data mining provides valid result, the approach is justified and more complex residue configurations may inherit this property.

44 Side chain interaction model. The determination of residue interactions requires a transformation of a full atom model into a simpler representation. This is because the mathematical model, that needs to describe all combinations of atom interactions of two residues, would be too complex. The solution is to replace the all-atom structure model with a coarse grained model, by reducing each residue to a single point. In principle, a residue point can be calculated either by the centre of mass, or the geometric centre (centroid). Each representation can be calculated from main chain atoms, main and side chain atoms, or side chain atoms only. The focus in this study is the side chain interactions within residue triplet configura- tion. For this reason, a protein structure is represented as a point spread of side chain centroids.

Protein structure triangulation. The extraction of residue triplets from a protein is based on triangulation of structures. Here structures are triangulated on the basis of three criteria. The first is the compositional constraint. Each residue in a triplet must be an element of the 20 natural amino acids, while hetero atoms are excluded. One prominent reason is that there are not many examples of residue-hetero atom interactions in the dataset that would support a statistical analysis. The second condition of triplet extraction requires that none of the residues are direct neighbours in the protein sequence. The assumption made here is, that any covalently bonded residues have a higher likelihood than any other two residues being next to each other in space that are not bonded. Similarly, the probability of finding three residues in space that are connected, is higher than finding unconnected triplets of residues. Conse- quently, the distribution of interacting residues in space would be over-represented. The definition of residue neighbourhood affects the data mining result, e.g. by requiring a pair interaction in the triplet to have a distance of more than one residue, patches of residues at one side of a beta-sheet may not be discovered. While tuning this parameter can modify the result of the data mining, the objective here is to discover new knowledge

45 from the input data set by providing as little as possible of biological information. The last criterion in triplet extraction is concerned with the geometrical property of a triplet. The Euclidean distances between the residues must fulfil the triangular inequality, while only two interaction distances of less than 6Awere˚ allowed. Although the interaction distance threshold is based on an empirical study of a number of protein structures, this value may not be adequate, because it would prefer close contacts of large side chains of residue pairs. For example the pair interaction of two tryptophans may have a near maximal allowed interaction distance of the centroids, while the distance of the contacting atoms are actually very close. The alternative is to set up a threshold system for residue pairs or triplets, which depends on the types of residues. Although this approach was not studied in this thesis, future work could improve the developed algorithm. Yet another approach in selecting residue interactions from a protein structure is based on the analysis of surface contacts of the side chain groups. While not all functional sites require their constituents to be in physicochemical contact (e.g. a metal binding site consists of metal ion coordinating residues without physical contacts), a protein binding site is an example where residues of two different proteins are in non-covalent interaction. However, the presented data mining approach aims in the unbiased search for residue interactions from a dataset of monomeric protein structure domains, and therefore a surface-based selection criterion will biased the analysis.

Implementation

A coarse grained representation is used in this protein structure analysis. From a full atom model of a protein structure, centroid positions of each protein residue were calculated on the basis of their side chain atoms. The resulting simplified structure model is then triangulated based on three criteria: (1) each residue in a triplet must be an element of the 20 natural amino acids; (2) pairs of residues in the triplet must not have a sequential relation in respect of their protein sequence position; and (3) only two pairs of residues

46 can have a maximal interaction distance of 6A,˚ and only one pair with an interaction distance of less than 12A.˚ For the interaction analysis it is necessary to define a hash table, based on integer values of centroid distances, and the name of residue. The integer value of a distance is calculated by dividing the measured distance by a precision value (hash precision), which was set at +/- 0.5A.˚ Given a 3-body with

trip = (A, B, C), (3.1) a three-dimensional hash table is defined as

HT (A, B, C) = 3D hash bin[i][j][k], (3.2) where i, j, and k are the integer values of measured distances between two spatial coor- dinate of residues. The integer values are given by the equation

i = INT (dist(A, B)/hash precision) j = INT (dist(B,C)/hash precision) (3.3) k = INT (dist(A, C)/hash precision).

For a detailed definition of the implemented hashtable cf. [Old01].

3.1.2 Detection of significant configurations as interactions

Theory

The method for residue interaction detection relies on the comparison of two probabilistic models: the reductionistic part-to-whole approximation model, and the holistic reference model. Part-to-whole approximation is modelled with a collection of marginal distribu- tions defined by subsets of the variables. Formally, a 3-body consists of three variables (cf. equation 3.1). To verify whether the probability of a triplet, P (A, B, C), can be

47 factorised, we attempt to approximate it by using all attainable marginals

M = {P (A, B),P (A, C),P (B,C),P (A),P (B),P (C)}. (3.4)

If the approximation fits the data, i.e. the probability of finding a particular triplet is explained by the approximation model, then there is no evidence for an interaction. In other words, a significant interaction is given when the two models are significantly different. The difference between two joint probability density functions O and M is measured by the Kullback-Leibler divergence

P O(i) (3.5) D(O||M) = i O(i)log( M(i) ).

In this context O usually refers to the observed probability or the reference model, while M is the approximation model. The null hypothesis in testing the interaction model is that the part-to-whole approximation matches the observed data. The alternative one is that the approximation does not fit and that there is an interaction. Three cases can be listed:

D(O||M) > 0 : there is a pattern among k attributes D(O||M) = 0 : there is no pattern of order k (3.6) D(O||M) < 0 : there is redundancy among the parts.

Within a 3-body system, four different configurations of interactions can be defined (cf. figure 3.2): no-interaction, one-pair interaction, two-pair interactions, and three-pair interactions. For each of these configurations it is possible to formulate a part-to-whole approximation model, i.e. the interaction can be factorised. In the case of no-interaction, the probability of the observable is expected to be estimated by its singlet probabilities

 ˆ k = 0 : P0(A, B, C) = P (A)P (B)P (C) , (3.7)

48 Figure 3.2: Four classes of interactions within a 3-body. A circle represents a protein residue, and an intersection resembles an interaction between two residues. k=0: no-interaction; k=1: one-way or one pair interaction; k=2: two-way or two-pair interactions; k=3: three-way or three-pair interactions.

49 whereas in a system with one-pair interaction, two variables are dependent on each other. Consequently, within a 3-body state there are three isoforms of one-pair interactions:

  Pˆ (A, B, C) = P (A,B)P (C)  1,1 P (A)P (B)  k = 1 : ˆ P (A,C)P (B) . (3.8) P1,2(A, B, C) = P (A)P (C)   ˆ P (B,C)P (A)  P1,3(A, B, C) = P (B)P (C) There are two forms of three variable interactions, but with different dependencies: two-pair interaction (k=2) and three-pair interaction (k=3). These interactions represent the target of 3D pattern mining. In a two-pair interaction, two pairs of variables are de- pendent on each other, while sharing a common attribute. For example, given A interacts with B, and B interacts with C, there is no clear observation that A also interacts with C. Three isoforms are formulated for this interaction:

  Pˆ (A, B, C) = P (A,B)P (B,C)  2,1 P (B)  k = 2 : ˆ P (A,C)P (A,B) . (3.9) P2,2(A, B, C) = P (A)   ˆ P (B,C)P (A,C)  P2,3(A, B, C) = P (C) In case of a three-pair interaction, all three variables are dependent on each other, and the approximation model is defined as

 k = 3 : ˆ P (A,B)P (B,C)P (A,C) . (3.10) P3,1(A, B, C) = P (A)P (B)P (C)

If the state is disturbed, e.g. by exchanging one variable, a partial interaction will not be observed. In respect of protein biology, this could mean that a residue mutation abolishes an intramolecular stabilising network. However, as this does not provide an evolutionary advantage the conservation of this residue is likely to be promoted and can be detected as a recurrent structural feature. The determined sets of two-way (k=2) and three-way (k=3) interactions are the targets in this data mining.

50 Implementation

Triplets of residues are classified into one of the four defined interaction configurations. The classification is based on a non-parametric cross-validation sampling method de- scribed by [JB04]. A significant interaction is given when the two models O and M are significantly different. Because the data can be regarded as a sample of a multinomial dis- tribution, the representativeness of the approximation model can be tested by the self-loss function D(P 0||P ). Here, P 0 and P are the probability distributions from two equal sam- ple sizes. The weight of evidence of accepting the null hypothesis, i.e. the approximation

model, can be estimated by pcv-values from a 2-fold cross-validation. For each random sampling the dataset is partitioned into two equally sized subsets: one training set and one test set. From these subsets two joint probability distribution functions, P 0 and P are determined from the training and test set, respectively. The marginal distributions, singlets and doublets, are determined from P 0 to construct the part-to-whole approxima-

ˆ0 tion P . The pcv-value is defined as the probability where the self-loss is greater or equal to the approximation loss

0 ˆ0 pcv{D(P ||P ) ≥ D(P ||P )}. (3.11)

On the basis of pcv-values, an interaction is discovered if pcv ≤ α, and an interaction is rejected when pcv > α. High threshold values of α, e.g. 0.95, will bias towards an interaction and risk overfitting, while lower values, e.g. 0.05, moves the bias towards no- interaction model and risk underfitting. In this study, a reductionistic bias approach was chosen, to prefer a simpler no-interaction model, by selecting α = 0.05. The used value of α is based on the research work of [JB04].

51 3.1.3 Grouping and selecting frequent configurations

Theory

The result of data mining protein structures can be a large set of 3D pattern. The data needs to be clustered in order to select the most frequent pattern. The assumption behind data clustering is, that residue configurations in protein structures are unlikely to be absolute and static. By grouping spatially similar configurations, the geometrical variation of patterns can be compensated and their frequencies improved.

Implementation

The objective in this section is to identify frequent groups of geometrically similar triplets with identical chemical configurations. Data clustering was done in two steps. For each residue triplet combinations, the initial step is to group geometrically similar patterns, and then count the combined frequencies

i+1 j+1 k+1 X X X G(HT (i, j, k)) = HT (i, j, k), (3.12) i−1 j−1 k−1 where HT is a hash table of the residue triplets (cf. equation 3.2). Then local geometrical peaks were searched by comparing the frequencies of the grouped triplets

arg max G(HT (a, b, c)) < G(HT (i, j, k)), (3.13)

where HT (a, b, c) 6= HT (i, j, k) with a = {i − 1, i, i + 1}, b = {j − 1, j, j + 1} and c = {k − 1, k, k + 1}. The second step in data clustering finds subgroups of triplets from a local peak, based on an all atom structure alignment. The determined clusters are ranked by their proba-

52 Dataset PDBIDs Domains Domain definition Data selection Properties

OLDFIELD 1,442 2,320 mathematical Sequence align- Homologous structural ment features of divergent proteins.

SCOP40 3,449 4,734 human expert Sructure com- Convergent structural parison features of divergent proteins.

Figure 3.3: Non-redundant structure set for 3D pattern mining. The dataset OLDFIELD is based on the publication of [Old02], and SCOP40 was obtained from ASTRAL Compendium [BKL00]. The size of the datasets, the method for data selection, and key properties are summarised. bility scores, which is defined as:

#cluster member P (cluster) = . (3.14) #peak member

On the basis of P (cluster) a cluster of residue interaction is selected if P (cluster) ≥ τ. In this study, the threshold tau for selecting a cluster was set to 0.66.

3.2 Analysing available non-redundant protein struc-

ture sets

The significance of this data mining result is greatly dependent on the representativeness of the data. For the frequencies of structural features to be true, they would have to be taken from protein structures of all of the naturally occurring protein folds. However, such a data resource is not available at present (cf. section 2.1.1). This effectively means that protein structure data mining is bound by the availability of fold examples. While from a bioinformatical point of view, little can be done to improve the coverage of the fold space, a number of efforts have been dedicated to the compilation of non-redundant datasets from PDB. The results in this thesis are based on the study of two non-redundant protein struc- ture sets: OLDFIELD [Old02] and SCOP40 [HMBC97][BKL00]. Table 3.3 summarises

53 key features of each dataset. The major distinction between both datasets lies in the definition of a non-redundant dataset. The purpose in compiling OLDFIELD is to cre- ate a dataset that allows the detection of interesting structural equivalence from the non-specific structural features. The primary data selection is in sequence space. The resulting dataset contains only sequentially dissimilar protein fragments, while common fold motifs are preserved. This allows the detection of homologous structural components of divergent proteins. In contrast, SCOP represents a biased view of protein data by defin- ing classes in structure space. The assignment to a class, of a novel protein, is based on structure and sequence comparisons. SCOP40 is the data subset of SCOP, where sequen- tially divergent proteins with convergent structural features are retained. Because the classification contains structurally divergent proteins, any identified recurrent structural feature in SCOP40 is an indication of convergent evolution. Another distinction between OLDFIELD and SCOP40 is the method of identifying domain structures. In OLDFIELD, protein fragmentation was done mathematically by analysis Cα distances [Old01], while in SCOP40 human experts were recruited to process a batch of protein structures. Both approaches have their advantages and caveats. On one hand, an automatic structure domain identification system can deliver reproducible data, while the results may not be justified in some cases. On the other hand, expert curated data represent a single precision view, but the information is difficult to be reproduced as new data become available. The difference in automatic and manual data selection is also reflected in the size of the datasets. In 2002, the compiled non-degenerated domain structure set from OLD- FIELD listed 2,320 domain structures, corresponding to 1,442 PDB identifiers. In con- trast, SCOP40 contained 4,734 domain structures determined from 3,449 PDB identifiers in the same year.

54 3.3 Evaluation methods

The presented 3D pattern identification system is a discovery-driven data mining solution. The assessment of performance is done on two levels: the study of parameter dependency (presented in this chapter), and the validation of biological significance of the data (cf. chapter4). The effect of data-related parameters was studied by comparing the mined results from OLDFIELD and SCOP. In the first part of the analysis, the distributions of extracted residue triplets were compared. Then the determined sets of k=2 and k=3 interactions were studied. The developed data mining method is a three step process, and the study of algorithm- related parameter effects was studied on two levels. Although, the developed data mining method is controlled by many different parameters, the following key parameters were studied: residue interaction distance, and size of cross-validation to compute p-values. The effect of the interaction distance parameter was studied by varying the maximal distance between the centroids of residues. Three different distance settings were tested: 4A,˚ 6A,˚ and 8A.˚ Repeated cross-validation sampling was used to determine confidence values for residue triplet classification. Various iterations were tested (from 100 to 1,500 in steps of 100) to study the effect on the size of interaction datasets.

3.4 Results

3.4.1 Identification of residue interactions is dependent on data

selection

The result of a data mining analysis is greatly dependent on the input dataset. The objective in this section is to study the effect of data-related parameters by comparing

55 results from data mining on OLDFIELD and SCOP40. With 590,255 unique triplet configurations in SCOP40 and 429,471 in OLDFIELD, the common set of triangulated triplets is 381,578 (cf. figure 3.4). Due to the difference in the probability distributions of both datasets, the classification of residue interactions resulted in different sizes of interaction classes. A set analysis on the classification data shows, that the classes have different sizes of overlaps (cf. figure 3.5). For example, OLDFIELD/k=3 and SCOP40/k=3 have a large common set of residue configurations of around 89 per cent for OLDFIELD and 44 per cent for SCOP. In contrast, the common set of k=2 interaction is much lower, i.e. 21 per cent for OLDFIELD and 13 per cent for SCOP40. The analysis also found two proportions of non-agreed classifications (k2/k3 between OLDFIELD/SCOP40). These results highlight the effect of data selection on the data mining result. A different probability distribution of residue triplets, singlets and doublets is the reason, why certain residue configurations were classified as k=2 in one dataset, and k=3 in another dataset.

3.4.2 The interaction distance correlates with the distribution

of residue triads

The extraction of residue configurations is controlled by the data representation, feature extraction, and by the feature selection method. Structural features were extracted by triangulation of a protein structure, which was modelled by a point spread of side chain centroids. The goal in this section is to study the effect of varying the interaction distance parameter. For this analysis the dataset OLDFIELD was used. Table 3.1 summarises the determined set of residue triplets by using three different maximal interaction distances. With the change of the distance threshold, the amount of extracted triplets, and the probability distributions of the singlets and doublets are changed (data not shown). Consequently, the testing of significance of residue interactions returns different results. It must be noted, that a complete analysis with 8 Ainteraction˚

56 Figure 3.4: Distribution analysis of extracted residue triplets. The determined residue triplet distribu- tion from OLDFIELD is compared with SCOP40. The upper panel shows a set analysis of the extracted residue triplets (numbers are the unique counts of the residue configuration). The middle panel illus- trates the frequency of each triplet (t) (represented as information, I(t)) from the set of triplets (T). For a better visualisation the difference of the distributions is measured by the Kullback-Leibler divergence (lower panel).

57 Figure 3.5: Comparison of extracted residue triplets based on their interaction type. The determined k=2 and k=3 classification sets from OLDFIELD and SCOP40 are compared by a set analysis. Due to the interaction classification (k=2, and k=3) there is no intersection of all four datasets.

Triplets Distance Total Unique k=2 k=3

4 2,938 1,799 16 165 6 1,379,545 429,471 9,681 134,465 8 7,128,886 2,016,306 N/A N/A

Table 3.1: Study on the effect of varying the interaction distance threshold in structure triangulation. The different determined sets of residue triplet configurations in OLDFIELD were achieved by using the interaction distance thresholds: 4A,˚ 6A,˚ and 8A.˚

58 distance was not done in this study. In conclusion, the effect of varying the interaction distance on the triangulation out- put is in agreement with the expected result. While the frequencies of ”small” triplet configurations are the same for incrementing interaction distance threshold, the calcu- lated probabilities are different, because of the different distributions. This also affects the result of interaction classification.

3.4.3 Interaction classification is sensitive to the size of cross-

validation

Significance testing of residue interactions is a method for assigning confidence values to the classification of residue triplets. The p-values were calculated from a two-fold cross- validation with n-iterations of random data sampling. Here, the effect of varying the size of iterations is studied. OLDFIELD is used as dataset for this analysis. Figure 3.6 shows the logarithmic dependency between iteration size and determined classification sets. Regression analysis indicates, that the finite classification set was not found after 1,500-iterations. The study of classified residue interactions from each iteration revealed, that the set from iteration i is always a subset from the iteration j with i < j. In conclusion, the result of varying the iteration sizes indicates, that the classification sets are stable and reproducible. With the increase of iteration size, the determined sets do not altered, meaning classification result is reliable but additional elements are identified.

3.5 Discussion

3D pattern identification is the result of a data mining method that finds recurrent struc- tural features within a protein dataset. The developed analysis method consists of three major modules: triangulation of a protein structure, significance testing of residue inter-

59 Figure 3.6: The effect of varying the cross-validation sample size on significance testing of residue interaction. The diagram shows the increasing but converging number of determined residue triplet configurations with one-way, two-way, and three-way interactions at various iteration steps (from 100 to 1,500 in steps of 100) of a non-parametric cross-validation sampling.

60 action, and data clustering of the determined residue interactions. Protein structure triangulation is the basis of collecting spatial configurations of residues. The definition of residue interaction is a complex task, because an amino acid consists of many atoms. Many of them are candidates of interaction partners. A coarse grained model was used to overcome this problem, however, with the cost of redefining the inter- action distance. Instead of measuring interaction distances between atoms of two different amino acids, the distance between the side chain centroids is used. The theoretical physic- ochemical interaction distance between two atoms cannot be transferred to measure the centroid based side chain interactions. The upper bound of interaction distance of 6Awas˚ determined from several visual inspections and measurements of residue configurations. The analysis shows that with d = 6A,˚ various side chain rotamer configurations are cap- tured, which may represent a physicochemical interaction. By reducing the interaction distance threshold, a bias towards tightly inert residue configurations is observed. Con- versely, the increase in d results in a huge set of triplet combinations. Some of the larger triplets do not capture a 3-body interaction, but may be part of a four-body interaction, where the fourth residue is situated between all three residues. Although larger interac- tion states may reflect a complete picture of a structural unit, the primary aim here is to find local and adjacent interactions of residues. The performance of correlation analysis based on hash tables is sensitive to positional errors, which is typically translated into the computation of ”wrong” hash bin indices. Consider the sample values a = 3.99, b = 4.01, and c = 4.99, where a is assigned to hash bin index i(a) = 1, while b and c are assigned to i(b) = i(c) = 2. The difference between a and b is actually less than b and c. The correlation analysis with these hashed data seems to be inadequate, although the ”correct” hash bin is in the neighbourhood. A solution to this problem is to consider adjacent hash bins, i.e. rectangular region, of the table [LW91]. The identification of an interaction class, e.g. a two-way interaction, is based on a

61 probabilistic classification approach. Confidence values were assigned to the classification result, by calculating p-values from non-parametric cross-validation sampling. Theoret- ically, the more sampling iterations are used the more stable become the calculated p- values. At a certain point, the size of the determined interacting residues should converge to some value. The implication of determining a stable p-value is the identification of a finite set of residue interactions. Within this study, the final set was not determined and for practical reasons, a set after 100 iterations was used. The output of extracted patterns depends on the distribution of structural features in the input dataset. The introduced algorithm is based on the assumption that there are significant trends of residue configurations in proteins, if these interactions provide a significant functional or structural advantage. Obviously, we cannot expect that data mining on two differently defined data selection would deliver the same mining output. From a mathematical point of view, the results are still correct, because the algorithm is detecting recurrent residue configurations in the data.

3.6 Conclusion

In this chapter, I have presented a novel data mining approach for the discovery of 3D patterns in protein structures. A pattern is a residue triplet with two- or three-way interaction of residues. The extraction of 3D patterns is not only dependent on algorithm- related parameters, but also on the data selection. The validity of the data mining approach is justified on the basis of knowing the limits and effects of data and parameters. In the following chapter, I will present the biological significance of the mined result.

62 Chapter 4

Prediction of functions for mined residue triads

In the previous chapter, a data mining approach was introduced, that identifies recurrent interacting residues as triplets in protein structures. Assuming, that a certain residue configuration is conserved in evolution, if it provides a structural or functional advantage, then the mined 3D pattern may represent a functional site in the protein. The objective in this chapter, is to demonstrate the biological validity of the data mined results, by cross- validation with a reference database. I present two example cases of validated residue interactions. The first example represents the validation of a metal binding site, where the mined patterns represent either a homologous or a convergent structural feature. The second validation identifies the catalytic triad from the mined data. The analysis includes the search for a 4-body configuration of the catalytic triad (quartet), in order to find a previously reported conserved serine residue. The result presented in this chapter demonstrates the biological significance of the mined data, and justify the data mining approach.

63 4.1 Evaluation methods

The biological significance of the mined 3D patterns is demonstrated by the rediscovery of known residue interactions. A systematic performance analysis, in terms of coverage and accuracy is not possible, because a test set with complete functional annotations of local residue interactions with biological function is not available. Therefore, various protein databases were used as references for cross-validations. The automatic cross-validation of metal binding sites is based on the comparison of the mined 3D patterns with a metal binding site database. Two reference databases were used and the results compared with each other: MSDsite [GDO+05] and MDB [CHR+02]. The identification of available metal binding sites in the input dataset considered only configurations with more than 2 residues. A hit was found, if all residues of a metal binding site were present in a protein structure. Likewise, a mined 3D pattern was identified as a metal binding site, if all residues of the pattern resemble a subset of a metal binding site. However, because a metal binding site can contain more than three residues, and the mined patterns can have two overlapping triplets, only identified metal binding sites were counted and not every matched pattern. The coverage is computed as:

#unique sites matched by all residues in a 3D pattern ccoverage = . (4.1) #available sites in protein structure set

The result of metal binding site cross-validation is compared with the performance of SIDEMINE [Old02] extraction. Because a similar experiment was not performed before, it was repeated here. The cross-validation of a metal binding site is analogous to the identification of active sites in the dataset (cf. above). The identification of a convergent metal binding site was done by a manual search in the mined output from SCOP40. The protein structures of a found metal binding site pattern were analysed in respect of their SCOP classification identifiers.

64 OLDFIELD #Triangulated Interaction #Classified #Clustered #Pattern triplet type interactions patterns frequencies

k=2 9,681 925 5,697 429,471 k=3 134,465 1,007 11,957

SCOP40 #Triangulated Interaction #Classified #Clustered #Pattern triplet type interactions patterns frequencies

k=2 15,455 765 927 590,255 k=3 269,683 2,019 2,361

Table 4.1: Summary of extracted data at each protein structure data mining step. The data mining was performed on OLDFIELD and SCOP40. The number of identfied residue triplet interactions is given in ”#Classified interactions”, while the column ”#Clustered patterns” indicates the size of unique residue interaction configurations after data clustering, and ”#Pattern frequencies” is the total amount of examples of the found residue interactions in the dataset.

The automatic cross-validation of catalytic residues was done by comparing residues from active site templates in CSA [PBT04]. The validation of a catalytic active site for all example protein structures was based on manual analysis. To test whether the mined result contains a second conserved serine residue in the catalytic triad (quartet) (Asp-His-Ser/Ser), larger residue configurations were constructed. The method for finding N-bodies is based on the algorithm of [Old02]: two 3D patterns (triplets) from the same protein structure were combined, if they share two common residues. The analysis considered only the search for 4-, 5-, and 6-bodies.

4.2 Results

In the following sections, the biological significance of the mined 3D patterns is evaluated. Data mining was performed on the datasets OLDFIELD and SCOP40 with the following parameters: interaction distance d = 6A,˚ cross-validation iteration = 100, and selection of cluster based on τ = 0.66 (cf. section 3.4). Table 4.1 summarises the extracted data at each processing step.

65 MSDsite Reference Dataset Determined Validated Coverage

OLDFIELD 567 85 0.15

SIDEMINE OLDFIELD 567 60 0.11

MDB Reference Dataset Determined Validated Coverage

OLDFIELD 302 36 0.12

SIDEMINE OLDFIELD 302 18 0.06

Table 4.2: Identification of metal binding sites in OLDFIELD. The available metal binding sites in the protein domain structures in OLDFIELD (input dataset) were determined by two reference databases (MSDsite and MDB). The figures were compared with the cross-validated metal binding sites in the mined 3D pattern dataset. A hit was found in the pattern data, if all three residues of a pattern is a subset of residues of a metal binding site. The performance was measured in terms of coverage.

4.2.1 Identification of homologous metal binding sites

Metal binding proteins play a vital role in a wide range of biological processes, such as structural stability and complex formation. The identification of metal binding proteins is therefore crucial. The objective in this section is to identify metal binding sites within the mined 3D patterns from OLDFIELD by cross-validation with the reference databases MSDsite [GDO+05] and MDB [CHR+02]. Table 4.2 lists the number of determined metal binding sites in the input dataset and the validated 3D patterns. The analysis shows that the determined coverage for both references is quite similar providing some confidence in the determined value. While the mined result covers only a small fraction of the available metal binding sites, the performance is comparable with SIDEMINE. A manual analysis shows, that some of the annotated metal binding sites can be par- tially recovered by merging two 3-bodies into a single 4-body. For example, the MSDsite lists the iron binding site, Asp-3His, for the PDB entry 1ar5 with the residues ASP161, HIS27, HIS75, and HIS165. The mined result from OLDFIELD contains the patterns

66 2His-Trp and Asp-His-Trp, with the residues HIS27, HIS75, TRP126, and ASP161, HIS75, TRP126, respectively. Both triplets can be merged into the 4-body Asp-2His-Trp. A systematic analysis of false negatives is beyond the scope of this work. However, preliminary studies indicate, that the selection of interaction distance, plays an important role in discovering 3D patterns. For example, by setting the interaction distance d to 8A,˚ various triplet configurations can be extracted that contain the missing histidine, HIS165, from the example above. The validity of a mined 3D pattern as a metal binding site is demonstrated by manual analysis of several example structures. The examples shows that the residues of a metal binding site have a strong conservation of the side chain groups, indicating a high energy bond in the formation of a coordinative tetrahedral site. Figure 4.1 illustrates an example configuration with three cysteines from six structure examples. The listed proteins are heterogeneous in nature but are common in the 3Cys mediated ion binding site. Except for one entry all structures coordinate a zinc ion in a tetrahedral configuration. Another metal binding site with the configuration Cys-2His is shown in figure 4.2. The cluster lists 11 proteins with the majority being electron transfer proteins. In conclusion, the mined 3D pattern data contain validated metal coordinating residue configurations. The result indicates, that the presented data mining system is able to identify homologous structural features, which are recurrent in the dataset.

4.2.2 Validation of convergent metal binding sites

Proteins with different folds can share a common structural feature. For example, various metal binding sites share a common residue arrangement, while the global fold of the metal binding proteins is quite different. In this case, the common pattern represents a convergent structural feature. The objective in this section is to test whether the developed data mining algorithm is able to find patterns of convergent structural features. For this analysis, the data mining was performed on SCOP40.

67 PDBID Description Bound metal

1h2r periplasmic hydrogenase nickel-iron 1lat glucocorticoid receptor zinc 2nll retinoic acid receptor zinc 1ptq protein kinase c zinc 2ohx alcohol dehydrogenase zinc 4mt2 metallothionein isoform II zinc

Figure 4.1: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with 3D pattern from OLDFIELD identified the 3Cys configuration (top panel). List of protein structures with the common 3Cys residue configuration (bottom panel).

68 PDBID Description Bound metal

1kdi plastocyanin cu 1aoz ascorbate oxidate cu 6paz pseudoazurin cu 1jer stellacyanin cu 2azu azurin cu 1bqk pseudoazurin cu 1aac amicyanin cu 1byo plastocyanin cu 1as7 nitrite reductase cu 1nic nitrite reductase cu 1rcy rusticyanin cu

Figure 4.2: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with 3D pattern from OLDFIELD identified the Cys-2His configuration (top panel). List of protein structures with the common Cys-2His residue configuration (bottom panel).

69 PDBID Description Bound metal

1iml metal-binding protein zn 1zin phosphotransferase zn 1kk1 translation zn 1ibi metal-binding protein zn 1dgs ligase zn 1hc7 aminoacyl-trna synthetase zn 1gax ligase/rna zn 1dsv virus/virus protein zn 1i50 zn 1ptq phosphotransferase zn 1zbd complex (gtp-binding/effector) zn 1kb4 transcription/dna zn 1dcq metal binding protein zn 1jj2 ribosome cd 1vfy transport protein zn 1ffy ligase/rna zn 1dcq metal binding protein zn 1dsz transcription/dna zn 1d66 transcription regulation cd 2alc dna binding protein zn 1tfi transcription regulation zn 4mt2 metallothionein zn 1jr3 transferase zn 1a5t zinc finger zn 1jjd metal binding protein zn 1bor transcription regulation zn 1zbd complex (gtp-binding/effector) zn 1g25 metal binding protein zn 1pyi complex (dna-binding protein/dna) zn 1hwt complex (activator/dna) zn 1het oxidoreductase) zn

Figure 4.3: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with 3D pattern from SCOP40 identified the 3Cys configuration (top panel). List of protein structures with the common 3Cys residue configuration (bottom panel).

70 PDBID Description Bound metal

1ncs transcription regulation zn 1rmd dna-binding protein zn 2drp complex (transcription regulation/dna) zn 1yuj complex (dna-binding protein/dna) zn 1a1i complex (zinc finger/dna) zn 1ubd complex (transcription regulation/dna) zn 5znf zinc finger dna binding domain zn 2gli complex (dna-binding protein/dna co 1tf3 complex (transcription regulation/dna) zn 1bhi dna-binding regulatory protein n/a 1e53 transcription zn 1g2a hydrolase ni 1jym hydrolyse co

Figure 4.4: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with 3D pattern from SCOP40 identified the Cys-2His configuration (top panel). List of protein structures with the common Cys-2His residue configuration (bottom panel).

71 3Cys SCOP classification SCOP domain identifiers

a.4.11.1 1i50j a.27.1.1 1ffya1 a.60.2.2 1dgsa1 b.35.1.2 1heta1 c.26.1.1 1gaxa3 c.37.1.8 1kk1a3 c.37.1.13 1jr3a2, 1a5t 2 g.38.1.1 1d66a1, 2alca , 1pyia1, 1hwtc1 g.39.1.2 1kb4b , 1dsza g.39.1.3 1iml 2, 1ibia1, 1ibia2 g.39.1.6 1jj2t g.40.1.1 1dsva g.41.2.1 1zin 2 g.41.3.1 1tfi g.44.1.1 1bor , 1g25a g.45.1.1 1dcqa2 g.46.1.1 4mt2 , 1jjda g.49.1.1 1ptq g.50.1.1 1vfya , 1zbdb g.56.1.1 1hc7a3

Cys2His SCOP classification SCOP domain identifiers

g.37.1.1 11ncs , d1rmd 1, d2drpa1, d2drpa2, d1yuja , d1a1ia1, d1ubdc1, d5znf , d1ubdc2, d2glia4, d2glia2, d2glia3, d1tf3a1, d1bhi g.49.1.2 d1e53a d.167.1.1 d1g2aa , d1jyma

Table 4.3: Convergent metal binding sites identified in SCOP40. The determined metal binding sites from the 3D patterns in SCOP40 belong to different fold classes of unrelated proteins (convergent struc- tural feature).

Two patterns were identified in this study that represent metal binding sites. The 3Cys configuration is the first example with 31 structure examples (cf. figure 4.3). The second metal binding configuration is the Cys-2His pattern with 17 structure examples (cf. figure 4.4). A visual analysis determined that the identified metal binding sites from SCOP40 are similar to the mined result from OLDFIELD (cf. previous section). Accord- ing to the SCOP classification scheme, groups of protein structures can be determined, that have different domain structures, but share the same metal binding site (cf. ta- ble 4.3). This indicates that the pattern was found as a recurrent structural feature in evolutionary distant proteins.

72 The result of this analysis suggests that the developed data mining algorithm is able to find recurrent and convergent structural features in a non-redundant structure set.

4.2.3 Recovering active sites and catalytic triads from the dataset

The catalytic triad is one of the most characterised non-metal active sites of serine pro- teases. The enzymatic reaction is based on the conserved residues serine, aspartate, and histidine that work together in a specific spatial arrangement. Previously, the identifi- cation of the catalytic triad has been described as the key evaluation analysis in protein structure data mining, because the occurrence of this pattern is just above the noise level in a dataset of analogous proteins [Old02]. The objective in this section is the search for active sites, and the catalytic triad in particular, by cross-validation with CSA. The mined result from OLDFIELD was analysed in this study. Within OLDFIELD, 235 active sites were determined, while the number of cross- validated active sites from the mined output was 27. Table 4.4 lists the validated protein residues. The majority of these residues are found in the Asp-His-Ser pattern, which was validated as the catalytic triad by manual analysis. The identified catalytic triad configuration lists 22 structure examples, with the majority belonging to the enzyme class hydrolase, and only a few belongs to the class oxidoreductase. In comparison, [Old02] identified 9 proteins, where 7 out of 9 were rediscovered in this analysis. The remaining 15 out of 22 are additional and approved solutions. Figure 4.5 shows the superimposed structures for the Asp-His-Ser configuration. This study shows that the presented data mining system is able to find the catalytic triad in OLDFIELD. The mined result contains 15 additional valid solutions that were not discovered in [Old02].

73 3D pattern (k=2) Cross-validated Pattern PDBID RID CSA SIDEMINE EC UID

Ala-Arg-Asn 1qgj A ALA 71, A ARG 38, A ASN67 + 1.11.1.7 PER59 ARATH 7atj A ALA 74, A ARG 38, A ASN 70 1.11.1.7 PER1A ARMRU His-2Ser 1elt A HIS 57, A SER 195, A SER 214 + 3.4.21.36 ELA1 SALSA 1ppf E HIS 57, E SER 195, E SER 214 3.4.21.37 ELNE HUMAN 1bma A HIS 60, A SER 203, A SER 222 + 3.4.21.36 ELA1 PIG 1avw A HIS 57, A SER 195, A SER 214 + 3.4.21.4 N/A 1hyl A HIS 57, A SER 195, A SER 214 + 3.4.21.- COGS HYPLI 1bit A HIS 57, A SER 195, A SER 214 3.4.21.4 TRY1 SALSA 1jrt A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRY1 BOVIN 1try A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRYP FUSOX 1au8 A HIS 57, A SER 195, A SER 214 3.4.21.20 CATG HUMAN 1ct0 E HIS 57, E SER 195, E SER 214 + N/A N/A Asp-His-Ser 1a8q A ASP 223, A HIS 252, A SER 94 + 1.11.1.10 BPA1 STRAU 1a7u A ASP 228, A HIS 257, A SER 98 + 1.11.1.10 PRXC STRAU 1a88 A ASP 226, A HIS 255, A SER 96 + 1.11.1.10 PRXC STRLI 1a8s A ASP 224, A HIS 253, A SER 94 + 1.11.1.10 PRXC PSEFL 1tib A ASP 201, A HIS 258, A SER 146 3.1.1.3 LIP THELA 3tgl A ASP 203, A HIS 257, A SER 144 3.1.1.3 LIP RHIMI 1bs9 A ASP 175, A HIS 187, A SER 90 + 3.1.1.6 AXE2 PENPU 1avw A ASP 102, A HIS 57, A SER 195 + + 3.4.21.4 N/A 1acb E ASP 102, E HIS 57, E SER 195 + + 3.4.21.4 CTRA BOVIN 1taw A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 N/A 1au8 A ASP 102, A HIS 57, A SER 195 + + 3.4.21.20 CATH HUMAN 1elt A ASP 102, A HIS 57, A SER 195 + 3.4.21.36 ELA1 SALSA 3tgi E ASP 102, E HIS 57, E SER 195 + 3.4.21.4 TRY2 RAT 1agj A ASP 120, A HIS 72, A SER 195 + 3.4.21.- ETA STAAU 1auo A ASP 168, A HIS 199, A SER 114 + 3.4.22.38 CATK HUMAN 1arb A ASP 113, A HIS 57, A SER 194 3.4.21.50 API ACHLY 1jrt A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 TRY1 BOVIN 1try A ASP 102, A HIS 57, A SER 195 3.4.21.4 TRYP FUSOX 2tec E ASP 38, E HIS 71, E SER 225 + 3.4.21.66 THET THEVU 1ppf E ASP 102, E HIS 57, E SER 195 + + 3.4.21.37 ELNE HUMAN 1jfr A ASP 177, A HIS 209, A SER 131 N/A P83850 STREX 1ct0 E ASP 102, E HIS 57, E SER 195 + + N/A N/A

3D pattern (k=3) Cross-validated Pattern PDBID RID CSA SIDEMINE EC UID

Ala-Asp-Ser 1brt A ALA 123, A ASP 228, A SER 98 + 1.11.1.10 BPOA2 STRAU 1onr A ALA 225, A ASP 17, A SER 176 2.2.1.2 TALB ECOLI Asp-Cys-Lys 1nba A ASP 51, A CYS 177, A LYS 144 + 3.5.1.59 CSH ARTSP

Table 4.4: List of cross-validated active site residues. The catalytic residues in the mined k=2 or k=3 residue triplets were compared against active site templates in CSA. RID = a Residue identifier consisting of a chain identifier + a residue name + a residue sequence position.

74 Figure 4.5: Re-discovery of the catalytic triad in OLDFIELD. Examples of protein structures with the Asp-His-Ser pattern were cross-validated by CSA.

4.2.4 Discovering the conserved serine residue in the catalytic

triad (quartet)

The catalytic triad template (Asp-His-Ser) has been reported as a four residue config- uration (Asp-His-Ser/Ser) [WBT97][BFW+94]. Based on the identified catalytic triad pattern in OLDFIELD (cf. previous section), the objective in this section is to test whether a 4-body or even larger residue configurations can be generated, based on the mined 3D patterns. In addition, the analysis searches the conserved serine residue in these extended configurations. The result of extending the catalytic triad is summarised in table 4.5. With 10 out of 22 structure examples having a single residue extension, only 7 out of the 10 determined 4-bodies contain the conserved serine residue (Asp-His-2Ser). Other 4-bodies were also found with an additional alanine or cysteine residue. Prelim- inary studies indicate that even larger configurations can be obtained, by combining the determined 4-bodies into a 5- or 6-body. However, the biological validity of the additional

75 PDBID Asp-His-Ser His-2Ser Ala-His-Ser Cys-His-Ser Ala-Asp-His

1jrt + + + 1au8 + + + 1ppf + + + 1avw + + + + 1ct0 + + + + 1elt + + + + 1try + + + + 3tgi + + + + 1acb + + + + 1arb + + + 2tec + 1agj + 1taw + 1a8s + 1jfr + 1a7u + 1auo + 1a88 + 1a8q + 1tib + 3tgl +

Table 4.5: Extending the catalytic triad into 4-bodies. Two pairs of residue triplets from the same protein structure are merged together if two of the residues are identical. The first column indicates the catalytic triad configuration, while the second column represents an extension with a previously reported conserved serine residue. The remaining columns shows other solutions of 3-body extensions with the catalytic triad. alanine or cysteine in a 4-body, or even other amino acids in larger residue configurations, needs to be determined. In conclusion, the presented algorithm is able to find the catalytic triad (quartet), i.e. the second conserved serine residue was rediscovered from data mining. While other residue configurations of 4-bodies were also found, the biological role of these residues is being investigated further.

4.3 Discussion

The biological cross-validation of the mined 3D patterns requires an adequate knowledge base as reference. A precision score cannot be estimated from cross-validation studies, because the result is the solution of discovery-driven data mining, and current knowledge bases have an incomplete coverage of functional sites. In this respect, the mined 3D patterns may contain known biological motifs, which are the detectable true positives,

76 or unknown functional sites, which cannot be confirmed yet. In addition, the result may contain noise, which is impossible to detect as false positives. The biological significance of the presented data mining was evaluated by examples of known biological functional sites: the metal binding site, and the catalytic triad. In particular, only known functional sites for proteins in the input structure set were used as benchmark. An alternative to this stringent evaluation is to transfer functional sites from homologous proteins, e.g. based on the Homology-dervied Secondary Structure of proteins (HSSP) database [SS96], and consider these information as true positive reference. About one third of the data in the PDB are protein structures co-crystallised with metal ions, which allows the study of metal binding sites [BW03]. Within the analysis, only a small fraction of proteins with metal binding sites were rediscovered. A systematic optimisation of the developed data mining algorithm was not pursued, e.g. by modification of feature selection criteria, because this would have exceeded the limit of this thesis. Preliminary studies on the source of false negative rate indicates, that the interaction distance threshold is the first parameter to be optimised. However with the change of this parameter the probability distribution of triangulated structural features is also modified and the effect cannot be estimated easily. The datasets OLDFIELD and SCOP40 are quite different (cf. section 3.2). OLD- FIELD consists of sequentially dissimilar protein structures, while the proteins may still share structure similarity. This property allows the mining of homologous structural features of divergent proteins, such as metal binding sites or the catalytic triad. The de- veloped data mining method was also tested, whether it can extract convergent structural features, by analysing SCOP40. This dataset consists only of divergent proteins with no global structural similarities. As a consequence, structural components are mainly repre- sented by convergent features, and the detection of these residue configurations might be below detection level. That is, the occurrences of convergent structural features are simi- lar to background level. However, metal binding sites are examples of convergent patterns

77 that were found in this study. The coordination of metal ions is greatly dependent on the distances and orientations of the conjugating residues. For that reason, data mining can detect these convergent structural features in structurally unrelated proteins. The presented data mining system identifies local three residue interactions with re- spect of their spatial and chemical configuration. In addition, examples of 4- and 5-body interactions were shown as a solution in extending the catalytic triad pattern. The analy- sis shows, that larger residue configurations can be found with the presented combinatorial approach. However, the search for larger structural patterns might deliver only protein stabilising features or other biological units in protein structures that are difficult to interpret.

4.4 Conclusion

The solution of this developed data mining algorithm is justified by the cross-validation of biologically relevant structure motifs provided in this study. The mining system is able to detect recurrent homologous or convergent structural features in the dataset. More importantly, two biological motifs, the metal binding site, and the catalytic triad, were rediscovered indicating, that the mined output contains biologically valid solutions. While the prediction of functional sites is an important task in structural biology, the biological interpretation of a 3D pattern requires evidences of biological significance. The combination with published biochemical and experimental data can provide evidences and a biological context for data interpretation. In the next chapter, I will present a biomedical literature mining system, for the extraction of functional annotation of protein residues.

78 Chapter 5

Identification of protein residues in MEDLINE

In this chapter, I present a text mining method to identify protein residues in biomedical texts. In the first step, the algorithm identifies the biological entities of residue, protein, and organism, and then determines the association of entity triplets. As a result a residue is linked to its source protein, and the protein is mapped to its hosting organism. Because the developed text mining solution relies on information from UniProtKB, an identified protein residue is directly linked to a unique Uniprot entry. One application of this method is the search for abstract texts in MEDLINE with protein residues, and then use the result for the update of citations in UniProtKB. The identification of protein residues in biomedical texts is a prerequisite for the extraction of functional annotation of residues.

5.1 Algorithms

The developed protein residue identification system is based on the algorithm of [HLC04]. Basically, the developed method is a four step procedure: biological entity recognition of organism, protein, and residue, and the association of the entity triplet. Figure 5.1 illustrates the procedures of this text mining system.

79 Figure 5.1: Overview of processes and evaluation methods for the developed protein residue identifica- tion system.

80 5.1.1 Protein and organism entity recognition

Theory

The recognition of protein and organism entities in text is based on a dictionary lookup approach. Basically, names of proteins, their synonyms, and their gene names are col- lected from UniProtKB to populate a protein terminology dictionary. The lookup of the protein dictionary considers the matching of morphological variants. The dictionary is not expanded by syntactical variants of terminological entries, like structural or formal variants, and addition of modifier or head word, because the lookup approach with the vast number of permutations requires much more computational memory resources. The alternative is to use a probabilistic approach. A similar method is also used to populate the organism terminology dictionary with names and synonyms from the NCBI Taxonomy database [WBB+06]. The lookup of terminologies also considers the matching of morphological variants.

Implementation

The recognition of protein entities was based on an approach that combined dictionary lookup with basic disambiguation [RSKA+07]. All protein names and synonyms were collected from UniProtKB. Names of species were extracted from the NCBI Taxonomy references in UniProtKB, and their scientific and common names collected. The dictionary was complemented with terminologies describing only the referenced genus. Full organism names were augmented with abbreviated genus forms, i.e. first letter abbreviation of genus + specie. The fast and efficient method for annotating texts with protein and organism names was based on the publicly available web service called Whatizit [RSAG+08]. The result is an annotation of protein and organism names in text with references to UniProtKB and NCBI Taxonomy.

81 5.1.2 Entity recognition of protein residue

Theory

The identification of residue entities is based on the re-implementation of previously pub- lished regular expression patterns for point mutations [HLC04] [RSMA+04]. Here, the patterns are extended to capture in total three types of residues: wild-type, point mu- tation, and range of residues or pair of residues. Although amino acid sequences can be considered in the residue entity identification, the lack of information about sequence position prevents the precise association detection with proteins. The first basic type of residue mention is the single protein residue sequence reference, which consists of the name of an amino acid, followed by the sequence position number, e.g. ”Gly-12”, ”arginine 4”, ”Tyr74”, ”Arg(53)”. A point mutation is the second type of residue mention, where the description details the exchange of an amino acid at a given position. The common notation is the name of the amino acid, its sequence position number, followed by the exchange. The following are examples of point mutations found in text: ”W77R”, ”Cys560Arg”, ”ser-52->ala”, ”ala2-methionine”. Finally, the third type of residue mention describes either a range of residues or an interaction pair, e.g. ”Tyr 85 to Ser 85”, ”Trp27–Cys29”. The correct identification of this type of residue mention requires the consideration of contextual information, which is not handled in this version. The common notation is the string sequence: amino acid name, sequence position, a connection symbol or connection word, amino acid name, and then sequence position. In addition to the abbreviated notation, protein residues can be expressed in syntac- tical form, e.g. ”isoleucine at position 3”, ”substitution of Ala at position 4 to Gly”, ”Ser472 to glutamic acid”. Additional patterns were developed to accommodate these and other less precise defined residue mentions in syntactical form, e.g. ”residue at po- sition 22, 34, and 40”. Although the entity triplet association algorithm does not utilise the latter identified residue mentions, annotation can generally be extracted for these

82 underspecified residues to increase the recall in information extraction.

Implementation

The extraction of residue mentions reuses the idea of designing regular expressions to find residue entities in text [RSMA+04][HLC04]. Some of the previously published regular expression patterns were adopted, while other patterns were created to cover other types of residue mentions, such as basic abbreviational point mutation patterns. In this thesis, sets of regular expressions were developed and implemented as finite state transducer to identify three types of residue entities (cf. table 5.1): wild-type, point mutation, and range or pair of residues. The result is an annotation of residue mention in text with normalised expressions.

5.1.3 Association identification of the entity triplet organism,

protein, and residue

Theory

The association of the entities organism, protein, and residue is a difficult text mining task. Unlike the association of two proteins, e.g. the physical interactions of two proteins (protein-protein interaction), the binary semantic relationships of organism-protein and protein-residue are not necessarily explicitly stated in biomedical texts. For example, a protein may be mentioned at the beginning of a paragraph, while a site-directed mutation on the same protein is described in later sections. This is one reason why approaches relying only on language patterns or word distance metrics are not feasible to find protein- residue associations. The association task becomes more complex, when multiple proteins are mentioned in the text. Usually a residue has a one-to-one relationship with a protein, however two proteins can have the same residue at the same sequence position. While this ambiguity cannot be solved without deeper natural language processing techniques, the problem can be tackled with a knowledge based approach.

83 RANGE-TO = ("-"+ ("to" "-+")? | "to"); CONVERT-TO = ("to" | "-"+ ">"?); XAA = ( "X" | "XAA" | "xaa" ); POS = (1-9)(0-9)*; RESN1 = [ARNDCQEGHILKMFPSTWYVOUBZX]; RESN3 = ( [aA]la|ALA | [aA]rg|ARG | [aA]sn|ASN | [aA]sp|ASP | [cC]ys|CYS | [gG]ln|GLN | [gG]lu|GLU | [gG]ly|GLY | [hH]is|HIS | [iI]le|ILE | [lL]eu|LEU | [lL]ys|LYS | [mM]et|MET | [pP]he|PHE | [pP]ro|PRO | [sS]er|SER | [tT]hr|THR | [tT]rp|TRP | [tT]yr|TYR | [vV]al|VAL | [pP]yl|PYL | [sS]ec|SEC | [aA]sx|ASX | [gG]lx|GLX | [xX]aa|XAA); RESNF = ( [aA]lanine | [aA]rginine | [aA]sparagine | [aA]spart(ate|ic acid) | [cC]ysteine | [gG]lutamine | [gG]lutam(ate|ic acid) | [gG]lycine | [hH]istidine | [iI]soleucine | [lL]eucine | [lL]ysine | [mM]ethionine | [pP]henylalanine | [pP]roline | [sS]erine | [tT]hreonine | [tT]ryptophan | [tT]yrosine | [vV]aline | [pP]yrrolysine | [sS]elenocysteine | [aA]spartic acid or [aA]sparagine | [gG]lutamic acid or[gG]lutamine); SITE = ( (RESN3 | RESNF) POS "residue"? | (RESN3 | RESNF) "-"+ POS "residue"? | (RESN3 | RESNF) "residue"? "at position"? POS "residue"? | (RESN3 | RESNF) "(" POS ")" "residue"? | "amino acid"? "residue" "at position"? POS | "amino acid" "residue"? "at position"? POS | RESNF "residue" POS); SITES = ( RESNF"s" (("," | "and" | "or") RESNF"s")* | RESNF"s"? ("at position""s"?)? ("," | "and" | "or") (("at position""s"?)? ("," | "and" | "or") POS)+ | RESNF "residue""s"? | RESN3 "residue""s"? ("at position""s"?)? POS (("at position""s"?)? ("," | "and" | "or") POS)+ | RESN3 "residue""s"? | "residue""s"? ("at position""s"?)? POS ("," | "and" | "or") POS)+ | (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS ("," | "and" | "or") POS)+ | RESNF ("," | "and" | "or") POS)* "residue""s"?); RANGE/PAIR = ( "residue""s"? ("," | "and" | "or") RANGE-TO POS)+ | "amino acid" "residue"? "s"? ("," | "and" | "or") RANGE-TO POS)+ | ("resiude""s"?)? "at position""s"? ("," | "and" | "or") RANGE-TO POS)+ | RESI RANGE-TO RESI); MUTATION = ( RESN1 POS RESN1 | RESN1 "-" POS "-" RESN1 | RESN1 "(" POS ")" RESN1 | RESI CONVERT-TO (RESN3 | RESNF) | RESI RESN3 | "from" (RESNF | RESN3) CONVERT-TO (RESNF | RESN3) "at position" POS | (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS | RESI ("-"+ | CONVERT-TO) RESI "substitution");

Table 5.1: Regular expression patterns for the detection of residue mentions in text. The patterns recognise single (SITE) or multiple wild-type residue sites (SITES), a sequence range or residue pair (RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations of residues as well as grammatical expressions found in text.

84 The developed method in this work is based on the algorithm of [HLC04]. Basically, the identification of a protein residue can only be validated, if it is part of the protein sequence, as it is denoted in a reference database, e.g. UniProtKB. This requires that the protein mentioned in the text is further supported by evidence for the organisms under scrutiny to select the appropriate protein sequence from the bioinformatics database; that excludes the risk of using orthologous protein sequences.

Implementation

In this study, the developed system to identify the entity triplet association of organism, protein, and residue, was based on the algorithm described by [HLC04] with some modi- fications. In the first step proteins were associated with their hosting organisms. Given a protein, all pairs of protein-organism (specie) were determined from text and ranked ac- cording to a word distance measure. The word distance between two entities was defined by the smallest number of words between them. The identification of protein-organism began with the pair with the smallest word distance measure. A valid association was found, if a semantic relation was specified in UniProtKB. If an association was validated then the search was terminated, and the protein was annotated with the corresponding Uniprot identifier, otherwise the next entity pair from the list was tested. If no match between protein and organism (specie) was found, then the search was relaxed to genus matching. This relaxed matching is the expansion to the [HLC04] algorithm. Because entries in UniProtKB are species specific, the protein-organism (genus) association will result in a list of Uniprot identifiers as annotation of the protein. The second step of this algorithm was the association of residues with their source proteins. The procedure of selecting and ranking the residue-protein pairs was similar to the protein-organism association identification. For each pair that was to be tested the annotated Uniprot identifier of the protein was used to retrieve the protein sequence from the database. Three cases of results can be distinguished: (1) the residue correctly

85 matches the protein sequence; (2) several alternative sequences are matching from a list of proteins; and (3) no match can be found for the residue with the available protein sequences. If a match was found, then the residue was annotated with references to the protein, otherwise the search continued with the next pair from the ranked list.

5.2 The construction of evaluation test corpora

UniProtKB is one of the most comprehensive protein knowledge bases (cf. section 2.1.2). It contains manually curated functional annotations on three levels: protein, protein se- quence, and protein residue. Information is derived from surveys of biomedical articles, and entries are annotated with citation references (PMIDs; PubMed identifiers). How- ever, the precise association of a citation and a protein residue in context of functional annotation is generally not available. The test dataset for the developed functional annotation extraction is based on the citation references from UniProtKB. A Uniprot corpus was generated by retrieving ab- stract texts from MEDLINE that are indexed by the knowledge base. From the 136,566 citations listed in UniProtKB, a virtually complete set of 136,559 abstract texts was re- trieved from MEDLINE. Although not all information presented in the UniProtKB are necessarily available in the Uniprot corpus, the Uniprot corpus is a starting point for the evaluation of the developed text mining modules. In particular three derived test corpora were generated from the Uniprot corpus: the gold standard corpus with manual annota- tion (GC), and the two cross-validation corpora with annotated information derived from UniProtKB (XC1, and XC2). Figure 5.2 summarises key features in both test corpora. For the automatic evaluation of extracted data, a cross-validation corpus (XC) was derived from Uniprot corpus. This test set was used to analyse the performance of protein- organism (XC1) and residue-protein (XC2) associations. The test set was annotated automatically, i.e. the biological entities were detected with the same ER systems. The documents in the Uniprot corpus were scanned for tri-occurrences of organism, protein,

86 Dataset Gold standard cor- Cross-validation Cross-validation pus (GC) corpus (XC1) corpus (XC2) Abstracts count 100 55,998 4,503 Method of annotation manual automatic automatic total/unique residues 362/262 (with N/A N/A 262/191 having residue name + residue sequence position) total/unique proteins 990/511 N/A N/A total/unique organisms 323/123 N/A N/A total/unique associations 240/172 residue- NA/70,401 NA/10,152 protein-organism protein-organism protein-residue associations as UTP as URP Application Test the the type, Test set is assumed Test set is assumed amount and re- to contain the same to contain the same liability of the type of information type of information extracted informa- as GC, but cer- as GC, but cer- tion (reproduction tainty is not clear. tainty is not clear. of manually anno- Study the repro- Study the repro- tated information). duction of informa- duction of informa- tion contained in tion contained in the database. the database.

Figure 5.2: Test corpora for information extraction evaluation. Based on the citation references from UniProtKB a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora were derived from this corpus: (1) the gold standard corpus (GC), which resembles a manually anno- tated test set; and (2) the cross-validation corpora (XC1, XC2), which contains automatically assigned annotations based on information from UniProtKB. and residue in text and a subset was retained if the combinations of the identifier triplet (UID+TID+PMID) for each document can be found in the database. UID is the Uniprot ID, TID is the NCBI Taxonomy ID, and PMID is the PubMed identifier. If at least a single match was found, then a document was selected. For the non-matching combinations the corresponding annotations were removed from text. This results in the test set XC1 with the associated set of the triple identifier combinations UTP = (UID+TID+PMID). XC2 is a subselection from XC1 by filtering for documents where the identifier combination URP=(UID+RID+PMID) were validated by entries in UniProtKB. RID is a residue identifier which consists of a residue name + sequence position. 70,401 UTPs from 55,998 abstract texts were determined for XC1, and correspondingly 10,152 URPs were derived from 4,503 MEDLINE articles in XC2. The gold standard corpus (GC) was created through manual curation, since no suitable annotated corpora are available for this study. A random sample of 100 MEDLINE

87 abstract texts was drawn from the Uniprot corpus, where every abstract text must contain the tri-occurrences of organism, protein and residue. Notice that the detection of the entities was based on the entity recognition (ER) systems described in the previous section. It is not expected that the ER systems are performing at top level, and therefore a certain proportion of the filtered abstract texts contains false positives of identified entities. From this set of 100 abstract texts, manual analysis provided four types of annotations. The first type is the annotation of the biological entities of organism, protein, and residue, while the second is the annotation of entity triplet associations, i.e. organism-protein- residue. Notice that this process did not include the grounding of protein or organism entities to entries in the specialised databases, i.e. UniProtKB and NCBI Taxonomy. In addition, text segments of sentences with a residue entity were annotated, if they represent keywords for functional annotation. Finally, the association of a keyword and a residue was also annotated in GC. Notice, that the set of documents in GC is partially contained in XC2; only 26 abstracts are shared among both datasets. From manual annotation 38 entity triplet associations were determined, while the corresponding number from XC2 was 58. The total number of manually annotated triplet associations in GC is 172 (cf. figure 5.2). The major difference between both evaluation corpora is, that GC contains manually confirmed biological entities and their associations. In contrast, the same annotations in XC1 and XC2 were done with UniProtKB, based on the assumption that the same database information is present in abstract texts. The interpretation of performance analysis has to consider the properties of these evaluation test corpora.

5.3 Evaluation methods

The performance of each process of the developed protein residue identification system was scored against a manually annotated gold standard corpus. Proteins, where the protein entity recognition system and manual curation assigned the same entity (full

88 term matching) were considered as true positives (TP). The same rule also applied for counting TP for the detection of residue and organism entities. The evaluation of the entity triplet association detections considered only associations as TP, if both pair relations organism-protein and protein-residue were determined cor- rectly. If one of the relations was incorrect, a found association was counted as false positive (FP). In contrast, the automatic evaluation of the entity recognition and entity association detection systems were performed on XC. A true positive of an annotated entity within an abstract text was identified, if UniProtKB lists the same entity in context of the given PMID. For example, if organism X in text Y is also indexed in UniProtKB as a combination of TID+PMID, then a TP was counted. A correct protein-organism association was detected, if the determined identifier com- bination UTP was found in XC. Similarly, a correct residue-protein association was found, if the derived identifier combination URP was found in the test corpus. The effectiveness of the ER and the association detection systems was measured in terms of precision, recall and the balanced F-measure (F1):

#true positive precision = , (5.1) #true positive + #false positive

#true positive recall = , (5.2) #true positive + #false positive

2 ∗ precision ∗ recall F 1 = . (5.3) precision + recall

5.4 Results

The developed protein residue identification system in this study consists of four modules. The following sections assess first performances of biological entity recognition, and then

89 Unique residue entities Reference Dataset Available Extracted Common Precision Recall F1

Gold standard corpus 191 203 187 0.92 0.98 0.95

MutationGraB GPCR corpus N/A N/A N/A 0.98 0.77 0.86 MutationMiner Xylanase corpus N/A N/A N/A 1.00 0.85 0.92 MEMA Mutation corpus N/A N/A N/A 0.98 0.75 0.85

Table 5.2: Performance evaluation of residue entity recognition. The performance is compared with other published residue entity recognition systems: MutationGraB (GPCR corpus) [LHC07]; MutationMiner (Xylanase corpus) [BW05]; and MEMA (Mutation corpus) [RSMA+04]. Performance was measured in terms of precision, recall, and F1 measure. the association of the entity triplet organism, protein, and residue. The final section presents an application of the presented text mining solution that can be used to update the citation set of UniProtKB or any other derived databases.

5.4.1 Evaluation of organism, protein, and residue entity recog-

nition

The goal of biological entity recognition, in this study, is to detect the mentions of residue, protein, and organism in biomedical abstract texts. In order to evaluate the performance of the developed ER systems, the detections were compared against the results from manual curated test set, the gold standard corpus (GC). The evaluation shows that the developed regular expression patterns are highly usable for the detection of residue mentions in biomedical texts. ER for residue mention yields in a precision of 0.92 and a recall of 0.98. With an F1 measure of 0.95 the performance of this ER system is within range of previous reports on point mutation identification [LHC07][BW05][RSMA+04] (cf. table 5.2). The performance for protein mention identification is evaluated with 65% precision and 60% recall (62% F1 measure). The result is difficult to compare to previously reported systems, e.g. ProMiner and MutationMiner (cf. table 5.3), due to the different exper- imental setup. ProMiner was evaluated on the BioCreAtIvE corpus (80% F1 measure)

90 Unique protein entities Reference Dataset Available Extracted Common Precision Recall F1

Gold standard corpus 511 471 305 0.65 0.60 0.62

ProMiner BioCreAtIvE corpus N/A N/A N/A 0.8 0.8 0.8 MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79

Table 5.3: Performance evaluation of protein entity recognition. The performance is compared with the other published protein entity recognition systems: ProMiner (BioCreAtIvE corpus, Task 1B, protein and gene name identification) [HFM+05]; and MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms of precision, recall, and F1 measure.

Unique organism entities Reference Dataset Available Extracted Common Precision Recall F1

Gold standard corpus 123 109 88 0.81 0.72 0.76

MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79

Table 5.4: Performance evaluation of organism entity recognition. The performance is compared with the NER system of MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms of precision, recall, and F1 measure. which links the contained protein mentions to only a small set of organisms. However, we have repeated the experiment on the BioCreAtIvE dataset and the result suggests that our method yields a comparable performance (76% F1 measure). Conversely, the evaluation of MutationMiner not only considers abstract texts but also the content of the full-text articles which should improve the results (79% F1 measure). Although the developed organism entity recognition system relies on a similar dictio- nary lookup approach as protein entity recognition, the performance is higher (precision of 0.81 and recall of 0.72; cf. table 5.4). This indicates that the list of terminologies are precise and covers a wide range of expressions. In conclusion, with F1 measures of 0.95, 0.62, and 0.76 for the entity recognition of residue, protein, and organism, the developed text mining system is able to detect these three biological entities in biomedical abstract texts.

91 Unique resi.-prot.-org.-associations Reference Dataset Available Extracted Common Precision Recall F1

Gold standard corpus 172 79 65 0.82 0.38 0.52

MutationGraB Mutation corpus N/A N/A N/A 0.85 0.69 0.76 MEMA Mutation corpus N/A N/A N/A 0.93 0.35 0.51 MuteXt tinyGRAP N/A N/A N/A 0.88 0.83 0.85

Table 5.5: Performance evaluation of residue-protein-organism entity association detection. The perfor- mance is compared with the other published point mutation detection systems: MutationGraB (Mutation corpus1) [LHC07]; and MEMA (Mutation corpus2) [RSMA+04]. Notice that MEMA identified only asso- ciations but without grounding. Performance was measured in terms of precision, recall, and F1 measure.

5.4.2 Performance study on the entity triplet association

The objective of the developed association detection system is to identify the entity triplet of organism, protein, and residue. In this section, the performance of this detection system is studied by comparing the predicted association with the manually annotated associations in the gold standard corpus (GC). With a precision of 0.82 and a recall of 0.38 the developed detection system is a reliable method for association detection, and the precision is comparable to other related reports (cf. table 5.5). In comparison to the systems, MutationGraB and MuteXt, the low recall can be explained by the differences in the test corpora; both systems were evaluated on protein family specific full-text articles. The evaluated precision of MEMA is different from this study, because MEMA identifies only associations without grounding to Uniprot entries. Manual analysis isolated two main reasons for the low recall. First, the association of all the three entities failed in several cases, because the system did not find an association between protein and organism. Other cases were also encountered, where a protein- organism association was correctly identified, but a protein-residue association could not be found. A detailed explanation is given in the discussion section. Despite the low recall of this text mining module, the evaluation indicates that the developed method is able to detect associations of residue, protein, and organism. More

92 UTP Dataset Available Extracted Common Precision Recall F1

XC1 70,401 77,407 62,068 0.82 0.88 0.85

URP Dataset Available Extracted Common Precision Recall F1

XC2 10,152 10,876 9,325 0.86 0.92 0.89

Table 5.6: Performance evaluation of protein-organism and protein-residue entity association detec- tion. A cross-validation corpus (XC) from UniProtKB was obtained from MEDLINE, by first retrieving abstract texts from MEDLINE, searching for tri-occurrences of the named entities residue, protein, organ- ism, and then retaining only those entries for which the identifier combination of UTP (Uniprot identifier + NCBI Taxonomy identifier + PubMed identifier) was found in UniProtKB. The result is the test set XC1 for protein-organism association study. XC2 is a subset of XC1 by scaning for documents where the identifier combination URP identifier combination (Uniprot identifier + Residue identifier + PubMed identifier) was validated by UniProtKB. Performance was measured in terms of precision, recall, and F1 measure. importantly, the detected associations are in accordance with manually identified semantic relations between the three biological entities. With a precision of 0.82 the developed method is able to identify precisely protein residues in biomedical texts.

5.4.3 Cross-validation of identified residues with UniProtKB

In the previous section the system for the association of the entity triplet organism, protein, and residue, was evaluated manually on the gold standard corpus. The objective in this section is to perform an analysis on a larger test set by cross-validation with UniProtKB. For this task, the cross-validation corpora XC1 and XC2 were used. The analysis consists of a two-step association study, i.e. the association of protein-organism and residue-protein were evaluated individually. Table 5.6 summarises the results. With a precision of 0.82 and a recall of 0.88, the result for organism-protein association indicates that the system is able to extract correct semantic relations from XC1. The sec- ond step of the evaluation determines the performance of the residue-protein association detection. A similar precision score of 0.86 was determined, while the recall (0.92) was

93 triplet association/UTRP Resource Available Extracted Common Precision Recall F1

GC 38 61 29 0.48 0.76 0.59 XC2 58 61 52 0.84 0.90 0.87

Table 5.7: A specialised performance evaluation between GC and XC2. The test set consists of the 26 common documents between GC and XC2. A comparison of the annotated entity triplet associations from both resources shows that the list of targets are different. almost twice as high as the triple entity association determined with GC (cf. table 5.5). This can be explained by the differences of the used annotation methods for both test corpora. The entities and their associations in GC were determined manually and did not considered a grounding step. To better compare the performance between the GC and XC2 data the common set of 26 abstract texts from both corpora were studied (cf. section 5.2). By reusing the URP information from the cross-validation corpus the determined performance is similar to the one evaluated on the whole XC2 dataset (compare table 5.7 with table 5.6). However, the XC2-based evaluation is different form the manual-based annotation study. However, this result is different from the evaluation based on manual annotation. A detailed analysis shows that manual annotation determined 38 entity triplets, whereas XC2 lists 58 associations and only 25 of these are common among both data sets (data not shown). This indicates that the annotated targets in GC and XC2 are different and cannot be compared directly. The results indicate that the developed method is able to detect correct associations of residue, protein, and organism.

5.4.4 Identified residues in MEDLINE for Uniprot/PDB pro-

teins

The developed text mining system annotates an identified protein residue in a text passage with references to its source protein and its hosting organism. Therefore, each MEDLINE

94 Figure 5.3: Identified protein residues in MEDLINE. From a MEDLINE extraction, a subset of 2,884 Uniprot proteins were identified, with cross-references to 14,007 PDB entries, and a corresponding set of 18,427 MEDLINE records. In comparison, the citation set of the corresponding entries in UniProtKB has only 4,652 PMIDs. Only 657 out of 18,427 PMIDs are cross-validated by UniProtKB data. Dashed line = MEDLINE based extraction; solid line = database values. record with an identified protein residue can be used to update the citation set of a correspondent protein entry in UniProtKB, or any other hyperlinked database, e.g. PDB (UniProtKB/PDB). In this study, the whole MEDLINE was scanned with the developed protein residue identification method, and the determined set of PMIDs compared with the citation sets in UniProtKB/PDB (cf. figure 5.3; for an overview of databanks hyperlinks and citation references cf. section 2.1). The protein residue identification system found a total of 40,750 MEDLINE records where residues were associated with co-mentioned proteins. The unique count of Uniprot proteins within the entity triplet associations is 9,354, where 2,884 out of 9,364 proteins have hyperlinks to 14,007 PDB entries. Corresponding to these 2,884 Uniprot proteins

95 is the set of 18,427 out of 40,750 PMIDs. In comparison, UniProtKB indexes for these 2,884 Uniprot entries a set of 4,652 PMIDs. A set analysis determined that both datasets are common in 657 PMIDs. This means that only 3.6 per cent of the identified PMIDs can be cross-validated with UniProtKB (cf. figure 5.4). The low number of rediscovery can be explained, in that most of the annotations in UniProtKB are done from sections only available in full-text articles. Although the analysis was based on MEDLINE, the extraction was already able to find a large number of relevant abstract texts for citation expansion. With a precision of 0.82 (determined by gold standard evaluation), the estimated number of true positives in the PMID set is 15,110. In context of the 4,652 citations from the database for the 2,884 Uniprot proteins, and the consideration of the 657 re-discovered abstract texts, the result of MEDLINE analysis expands the citation set by 3 fold. In conclusion, the presented text mining system can be used to determine relevant literature data for the update of the citation sets in UniProtKB/PDB. The extracted abstract texts for those proteins provide the basis for functional anno- tation extraction.

5.5 Discussion

The presented text mining method identifies protein residues in biomedical texts. The first step is the recognition of the entities residue, protein, and organism in texts. The language expressions of all three biological entities are quite different. A residue entity, for example, is generally mentioned in the text by its three-letter abbreviation form + protein sequence position. The regular expression patterns were designed specifically for these and other derived expressions, which explains the high precision and recall of the residue entity recognition system. However, a residue can also be expressed by its one- letter abbreviation or syntactical form. While the latter expression is considered and implemented in this thesis, it was suggested that these expressions resemble only a small

96 Figure 5.4: Cross-validation of citations from identified protein residues with UniProtKB/PDB. For a subset of UniProtKB/PDB proteins (i.e. proteins with UID and PDBID) the determined PMIDs can be cross-validated with the relevant citation set from UniProtKB. Dashed line = the number of common PMIDs; uni = UniProtKB/PDB based citations; med = protein residue identification based citations; comm = common set of citations between uni and med.

97 fraction [LHC07] in biomedical texts. The implementation of one-letter abbreviation would increase the recall, but the method would become less precise. For example the matched string ”C4” could be a nucleotide, a gene, an atom in a chemical compound, or any other acronym. The identification of protein terminologies in text is a great challenge in the biomedical text mining community. This is based on the fact that protein names are not standardised, and the usage of many alternative names are common, e.g. abbreviations, pet names, or synonymous names. In addition, there is no guideline in the construction of names, therefore a name can be short or long in respect of word counts, e.g. ”MAP kinase kinase” and ”MAP kinase kinase kinase”. The developed protein entity recognition system is based on a lookup of names and synonyms in a dictionary. Because the entries are finite, syntactical variants of protein names cannot be detected, if they are not covered by the dictionary. This explains the low recall of this ER system. In contrast, sub-matching of a whole protein name or the tagging of ambiguous protein names reduces the precision of the method. For example, ”SNF” could be a protein in yeast or the funding agency ”Swiss National Science Foundation. The principle method for organism entity recognition is the same as protein name identification in this investigation. A list of terms from NCBI taxonomy was utilised to generate an organism name dictionary. Although the developed method is the same as protein entity recognition, the system yielded in a higher performance. One explanation is, that the dictionary contains predominantly unambiguous terminologies. However, some ambiguous terms can also be found, e.g. ”RAT” could be a protein, an organism, or a method. To my knowledge, a dedicated research in organism entity recognition has not been published nor is a gold standard for performance evaluation available. Based on the finding of residue, protein, organism entities in a text, the developed sys- tem identifies semantic relations between these biological entities. The approach is based on the idea of reusing explicitly stated relations contained in UniProtKB. The correct

98 association between protein and residue relies on several factors: the ER performance, the correct protein sequence retrieval, which is dependent on the correct organism-protein association, and the correct alignment of a residue with a protein sequence at the specified position. On one hand, a low recall in residue-protein association can be explained by a missing protein sequence variant in the repository. On the other hand, an incorrect protein-organism association leads to the retrieval of a wrong protein sequence. Another consideration is, that the protein sequence in the database could deviate from the au- thor’s data, because either side may have used different indexing rules. Conversely, the true positive rate can also be blurred by the same reason that a non corresponding residue sequence index results in a by chance matching with a protein sequence. One solution to this specific problem is to consider all residues of the same protein in the sequence align- ment. However, this method may only be applicable for full-text analysis, as abstract texts rarely mention multiple residues of the same protein. The evaluation of the entity recognition and the association detection systems was done by a manual analysis on the gold standard corpus, and by an automatic cross- validation study. This has the following reasons. Protein annotations in UniProtKB are primarily derived from manual information extraction from full-text articles. Although a considerable amount of these information may not be present in MEDLINE, the combina- tion of X+PMID, where X is either UID or TID, can be used to estimate the information extraction performance. However, the false positive rate in this cross-validation study cannot be determined, because the knowledge base is incomplete with information, and even for the indexed citations. Therefore, manual evaluation on a gold standard test set has the advantage to study the false positive and false negative rate. An identified protein residue is annotated with references to its source protein (Uniprot identifier) and the hosting organism (NCBI Taxonomy identifier). Based on these anno- tations a link can be made between MEDLINE and biological knowledge bases. One immediate application is to scan MEDLINE for protein residues and use the Uniprot

99 identifier annotations in combination with the MEDLINE identifier (or PubMed identi- fier; PMID) to update the citation sets of corresponding Uniprot entries. The significance of this approach was studied by automatic cross-validation analysis. Although, the results indicate that only a small proportion of Uniprot proteins can be found and associated with residues from MEDLINE analysis, the identified set of PMIDs has only a small overlap with the corresponding citation sets. One explanation is, that annotations were extracted from full-text articles, where the same information is not present in the abstract texts; they represent the true negative fraction in sense that the information cannot be identified from abstract sections. Another explanation is based on the fact that curators provide only a list of relevant citations from a batch of processed biomedical articles. In other words, the information of irrelevant citations (false positives) or the complete list of true positives of citations, from the sample of reviewed biomedical articles, is not available in UniProtKB which would have allowed a more precise evaluation.

5.6 Conclusion

The developed text mining solution identifies protein residues in text and annotates them with references to UniProtKB and NCBI Taxonomy. Based on these references, a link between MEDLINE and UniProtKB is created. Although the identification of protein residues in MEDLINE does not necessarily mean that functional annotations are present in abstract texts, the analysis is a prerequisite for the mining of functional annotation. The extraction of contextual feature as annotations of a protein residue is the topic of the following chapter.

100 Chapter 6

Information extraction from the context of a residue in text

In the previous chapter, I have introduced a method for the identification of protein residues in biomedical texts. The objective, in this chapter, is to extract textual features from the context of protein residues that can be used as functional annotation. Because a terminological resource is not utilised, the developed method can discover new information from text. The extracted contextual features are then enriched with semantic labels according to a categorisation scheme. The design of this scheme was data-driven, and contains concepts of biological interests. The overall result of this text mining solution is the annotation of protein residues with text segments that are classified by a set of biological categories.

6.1 Algorithms

The developed information extraction system can be divided into two parts: extraction of contextual features associated with protein residues, and classification of the extracted textual features. Figure 6.1 illustrates the procedures involved in the developed informa- tion extraction system.

101 Figure 6.1: Overview of processes and evaluation methods of the developed contextual feature extraction system.

102 6.1.1 Extraction of contextual features

Theory

Finding functional annotations of protein residues in biomedical text. In this study, several assumptions have been made for the extraction of functional annotations from biomedical texts, which are explained in the following. The first assumption is, that noun phrases in a text are semantically rich in sense, that they are able to represent a subject content (keyword) [JK95]. Consequently, they are good candidates of textual features for the functional annotation of protein residues. The second assumption is, that a biological function of a protein residue, can be found as verbal or nominal expression in natural language. In other words, a syntactical relation between a residue and a term can capture their semantic relation. Therefore, a syntactical analysis of a sentence enables the identification of an explicitly stated biological function. For example, from the phrase

”A inhibits B by phosphorylation of C”, the relations

A—inhibits—by-phosphorylation-of-C A—inhibits—B-by-phosphorylation A—inhibits—B UNK—phosphorylate—C, can be identified. Although the identification of a residue-keyword association can be attempted with co-occurrence analysis, the target is to extract reliable associations with contextual information on their association. In other words the type of association ex- pressed by a verb or by a preposition, and the context expressed by a prepositional phrase, are important bits of information that represent a justifiable functional annotation. A

103 discussion on semantic relation and syntactical relation extraction can be found in sec- tion 2.3.2. Generally, to identify description of biological function in text, the terminologies from GO can be reused. However, this ontology is actually not specialised on protein residues, for example the term ”active site” does not even appear as a stand-alone term in the repository. Generally, description of protein function refers to higher level of biological function, e.g. metabolomics or cell signalling. In contrast, the annotation of protein residues requires a different set of terminologies that describe molecular interactions or chemical reactions. Because a suitable terminological resource is not available, the extraction of syntactical relation focuses on semantic relations with the elements: residue entity and contextual feature (keyword). The following is a demonstration of how a description of function can be identified from a parsed sentence. Given the example sentence from MEDLINE

”Parathyroid hormone inhibits renal phosphate transport by phospho- rylation of serine 77 of sodium-hydrogen exchanger regulatory factor-1.” (PMID:17975671), a syntactical analysis produces the following phrase structure representation

104 [Parathyroid hormone]/NP [inhibits]/V [renal phosphate transport]/NP [by]/P [phosphorylation]/NP [of]/P [serine 77]/NP [of]/P [sodium-hydrogen exchanger regulatory factor-1]/NP,

where NP is a noun phrase, P a preposition, and V a verb. From this parsed sentence, the following semantic relations can be determined:

Parathyroid hormone—inhibits—renal phosphate transport-by- phosphorylation-of-serine 77 Parathyroid hormone—inhibits—renal phosphate transport-by- phosphorylation Parathyroid hormone—inhibits—renal phosphate transport UNK—phosphorylate—serine 77.

In the next section, a template for storing the extracted relation information is dis- cussed.

Semantic representation of extracted relations. The objective of syntactical re- lation extraction is to identify biological relations in a sentence, i.e. a semantic relation between a residue entity and a terminology. While the result is a set of syntactical rela- tions with different contextual specification (cf. example in previous section), a suitable

105 data collation method is necessary to avoid data redundancy. That is, the set of deter- mined relations, within a given syntactic frame contains a relation, which is a specification of another one. For example, the relation

A—inhibits—B-by-phosphorylation, is a specification of the relation

A—inhibits—B.

Here, the predicate-argument structure (PAS) is proposed as a semantic representa- tion of extracted syntactical relations. A PAS is a template for information extraction, where the predicate and the arguments represent the slots to be filled. In this study, the predicate (pred) of a PAS is defined as the verb, while the arguments of the verb are the numerically labelled arguments arg1 and , or even higher numerically labelled arguments. The arg1 label is assigned to arguments, which are understood as agents, causers, or experiencers, i.e. the semantic subject. Conversely, the arg2 label is usually assigned to the patient argument, i.e. the argument which undergoes the change of state or is being affected by the action. The transformation of the extracted relations into PAS data, does not consider the analysis of the semantic role of the verb arguments, i.e. argument modifiers, such as location, time, cause, etc. Noun phrases of the extracted relations can have prepositional attachments, and the preposition are often indicators of thematic roles of the verb ar- guments. Therefore, prepositional phrases are listed as modifiers of arguments with the following label notations: main argument label + preposition, e.g. arg1-of, and arg2- by. The following illustrates the transformation of relations into a PAS for the previous example:

106 pred = inhibit arg1 = Parathyroid hormone arg2 = renal phosphate transport arg2-by = phosphorylation arg2-of = serine 77, which corresponds to the following verb frame set:

inhibit sub-arg1 obj-arg2 P by-arg2 P of-arg2.

Notice, that the defined PAS does not accord to PAS schemes of some propositional banks, e.g. PropBank or PASBio. For example, for the verb ”inhibit” PropBank lists the following frame set:

inhibit sub-ARG0 obj-ARG1 inhibit sub-ARG0 S-ARG1, while additional arguments are not defined (notice, that the definition of ARG0 in Prop- Bank is equivalent to arg1 in this definition, and ARG1 corresponds to arg2). Although verb frame sets from publicly available propositional banks can be considered in this study, the set of listed verbs have a low coverage with the set of verbs co-occurring with residue mentions in MEDLINE. The low coverage and the non-domain specific verb frame sets are the main reasons why these resources were not reused.

Implementation

The extraction of contextual features is based on a syntactical analysis of natural language sentences. Two approaches were developed in this work and compared in the performance

107 evaluation study: shallow parser based relation extraction, and full parser based relation extraction.

Shallow parser based relation extraction. The first approach was to develop a shallow parser, which aims to find the boundaries of major constituents in a sentence, such as noun phrases. The design is based on heuristics and the idea of finding general relations between closed-class English words [LCM03]. The reported parser finds verbal relations between noun phrases, and prepositional relations of a set of the most frequent prepositions, i.e. ”of”, ”in”, and ”by”. Here, the parser is implemented as a general relation extraction method, where the list of prepositions are not limited to the three mentioned ones. The purpose is to find more contextual features, and thereby discover more information. Initially, an abstract text was split into sentences, and then annotated with part- of-speech (POS) tags using the CISTAGGER. The tagger was trained in the CISLEX lexical resource that contains a rich terminological set of the biomedical domain [Gue96]. Based on a rule set and the POS information the developed shallow parser identified noun phrases, verb groups, verb phrases, and prepositional phrases for analysed sentences:

NP = Det? (Adj|Adv|N)* N PP = P NP VG = (Adv|Aux|V|InfTo)* V VP = VG NP PP*.

N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, PP a prepositional phrase, VP a verb phrase, and VG a verb group. Notice, that the grammar does not consider coordinating conjunctions, e.g. with ”and”, ”or” and ”,”. The grammar can be easily extended to capture conjunctions by

108 NPx = NP (CC NP)*, where

CC = (”and” | ”or” | ”,”){1,2}.

However, the pattern would then also find false positives as illustrated in the following example. The sentence

”Highly conserved phosphopantothenate binding residues include Asn59, Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the adjacent monomer.” (PMID:12906824), contains the noun phrases

NP1 = ”Asn59, Ala179, Ala180, and Asp183 from one monomer” NP2 = ”Arg55’ from the adjacent monomer”.

The extended patterns would have extracted a single noun phrase, from which the iden- tification of the correct post-nominal prepositional phrase attachment cannot be done easily:

NPx = ”Asn59, Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the adjacent monomer”.

Based on the determined phrase structure, the parser then extracts verbal relations of noun phrases or prepositional phrases. A condition of the extraction is, that at least one relation element must contain one or more residue mentions:

109 REL = NP PP* VP.

The extracted relation is then transformed to fill the slots of the predefined PAS template.

Full parser based relation extraction. The second approach in contextual feature extraction utilises the full parser ENJU [MT05] (version 2.3), which generates a so called head-driven parse tree from a sentence. The advantage of this parser is, that a parsing model adapted to biomedical text is utilised. This parser generates predicate-argument relations between words. Because the generated output contains a lot of information, different interpretations are possible. In this study, a wrapper was developed that converts the parser’s output into the presented PAS data format. The assumption is, that by following the direct links of a verb to its arguments in the tree, and then collecting all the sub-branches of each argument, the phrase structure of a verb argument can be found. The identified NP PP* VP structures are then decomposed to fill the PAS template.

6.1.2 Categorisation of contextual features

Theory

A PAS captures a verb frame within a text sentence, where the arguments may represent a subject content. In order to evaluate the relevance of these arguments a semantic interpre- tation is needed. Here, a classification method was developed, that assigns automatically semantic labels to the arguments of a PAS. For this task, the categories have to be defined as suitable labels for information interpretation. Although an ontological model of protein residue function is not available, there are two approaches to this problem. The first is to adopt annotation schemes from various protein databases, e.g. the UniProtKB. This represents a top-down approach. One motivation for reusing the categorisation scheme of UniProtKB is, that classified information with this scheme can be directly used to update

110 the relevant fields in the database. Alternatively, a bottom-up approach can propose new categories. In this study, suit- able text segments from MEDLINE were analysed, if they represent suitable functional annotations for residues. The result, is an overview of information distribution in MED- LINE, which has led to the proposition of a categorisation scheme. The defined categories of both schemes are compared in table 6.1. Both categorisation schemes reflect concepts of biological interest. However the bottom-up approach has the advantage that proposed categories are data-driven, while in a top-down approach examples of listed categories may not be present in natural language text, or other categories are missing in the scheme. The assignment of categories to contextual features is based on the endogenous classifi- cation approach [Cer00]. In contrast, the exogenous, i.e. corpus-based, approach requires large amounts of contextual cues, which are difficult to obtain. According to the author, the endogenous approach is more reliable to produce results even under conditions of sparse data. From a reference set of terms with manually assigned labels according to a categorisa- tion scheme, the algorithm computes the mutual information of the lexical constituents of terms and their assigned categories. These scores are then used to calculate and select the highest scoring association of a term and a category. The algorithm was re-implemented and used in this study.

Implementation

The semantic interpretation of contextual features, which are the arguments of the ex- tracted PAS, relies on the endogenous classification approach described by [Cer00]. The method was re-implemented in this study. The algorithm relies only on the mutual infor- mation of the lexical constituents of terms and their assigned categories. During the training phase, lexical constituents of multi-word terms were extracted from a labelled reference set. They represent the features of the predefined categories.

111 MAN FEAT Category Defintion Category Defintion

STR COMP DOMAIN Extent of a domain, which is defined as a specific combination of secondary Structure component. Class denoting concepts that structures organised into a characteristic three-dimensional structure of fold. represent pieces and parts of the protein structure. MOTIF Short (up to 20 amino acids) sequence motif of biological interest. TOPO DOM Topological domain. CHAIN Extent of a polypeptide chain in the mature protein. TRANSMEM Extent of a transmembrane region. COILED Extent of a coiled-coil region. CHEM MOD VARIANT Authors report that sequence variants exist. Chemical modification. Class denoting changes to the protein sequence and the chemical composition. MOD RES Posttranslational modification of a residue. PEPTIDE Extent of a released active peptide. VAR SEQ Description of sequence variants produced by alternative splicing, alternative promoter usage, alternative initiation and ribosomal frameshifting. LIPID Covalent binding of a lipid moiety. CARBOHYD Glycosylation site. STR MOD Structural modification. Class denoting the changes REGION Extent of a region of interest in the sequence. 112 to the protein structure without changes to the chemical composition. SITE Any interesting single amino-acid site on the sequence, that is not defined by another feature key. BINDING Binding type. Class denoting different BINDING Binding site for any chemical group (co-enzyme, prosthetic group, etc.). physico-chemical forces leading to a bond formation between a protein structure component and a METAL Binding site for a metal ion. chemical entity. DISULFID Disulfide bond. CROSSLNK Posttranslationally formed amino acid bonds. DNA BIND Extent of a DNA-binding region. NP BIND Extent of a nucleotide phosphate-binding region. ZN FING Extent of a zinc finger region. CA BIND Extent of a calcium-binding region. ENZ ACT Enzymatic activity. Types of enzymatic reactions as ACT SITE Amino acid(s) involved in the activity of an enzyme. a subpart to protein functions. CELL Cellular phenotype. Class denoting different cellular N/A phenotypes that can be affected by structural or com- positional changes of a protein.

Table 6.1: Biological categories for the classification of protein residue related information. Two sets of schemes were used: a text data motivated definition of categories (MAN) determined from manual analysis of sentences with annotations for protein residues from MEDLINE, and key categories from the feature table of UniProtKB (FEAT). The association between both, a feature (w) and a category (c), was estimated based on their mutual information score

P (w,c) (6.1) I(w, c) = log2 P (w)P (c) .

n The association between the multi-word term T = {wi}i=1 and a category c was computed by the sum of the associations of its words

∗ Pn (6.2) A(T, c) = P (c) i=1 I(wi, c), where P ∗(c) is the probability of a category associated with a term. The categorization of a multi-word term into one of the categories, amounts to the identification of the best fitting category C∗ for a term, based on the words in a term

∗ c = arg maxc A(T, c). (6.3)

The reference set was generated, by using maximal length noun phrase (MLNP) anal- ysis. The assumption of this approach is that textual features co-occurring with a residue within a noun phrase (NPr) are good candidates of terms for functional annotation. In order to identify the boundaries of these candidate terms, the MLNP algorithm relies on the lookup of a determined set of noun phrases without nested residue entities (NP¬r). In other words, the algorithm assumes that nested terms in NPr are also expressed as stand- alone noun phrases, which can be identified by a broad syntactical analysis on MEDLINE. The following is an example for illustration. Consider the term

”complex formation”,

which is identified as a stand-alone noun phrase NP¬r in the sentence

113 ”The GlyNH2 was removed and the reactive-site peptide bond X18- Glu19 was synthesized by complex formation with proteinase K.” (PMID:9047374).

The same term co-occurs with a residue entity within another noun phrase (NP(r))

”Rb-E2F-DNA complex formation” in the sentence

”MDM2 also interacts with Rb through its central acidic domain and in- hibits Rb function in part by blocking Rb-E2F-DNA complex formation.” (PMID:16337594).

The determined MLNP in this example is ”complex formation”. Once the set of MLNPs were extracted, each item (NP) was manually labelled, based on a categorisation scheme. Within this study, two categorisation schemes (cf. table 6.1) were used independently and studied: the categories defined by manual analysis on MED- LINE sentences (bottom-up approach), and the categories defined as keys in the feature table from UniProtKB (top-down approach). The sets of categories from the bottom-up approach and from the top-down approach are referred as MAN and FEAT in this study. Table 6.2 compares the distribution of labels within the reference set. An illustration, where a determined MLNP can be used to find relevant information from contextual features of a protein residue, is the following example. From the sentence

114 MAN FEAT Category Frequency Category Frequency

STR COMP 433 DOMAIN 28 MOTIF 8 TOPO DOM 4 CHAIN 2 TRANSMEM 2 COIL 1 CHEM MOD 361 VARIANT 275 MOD RES 59 PEPTIDE 13 VAR SEQ 6 LIPID 3 CARBOHYD 1 STR MOD 25 REGION 100 SITE 246 BINDING 195 BINDING 139 METAL 25 DISULFID 11 CROSSLNK 10 DNA BIND 6 NP BIND 5 ZN FING 2 CA BIND 1 ENZ ACT 90 ACT SITE 110 CELL 161 N/A GEN BIOL 2,172 GEN BIOL 2,372 GEN ENG 643 GEN ENG 651

Table 6.2: Category distribution in the text feature reference set. The text feature reference set was compiled from maximal length noun phrase analysis (MLNP) from two sets of noun phrases: one without residue mentions and the other with identified protein residue entities. The features in the reference set were manually assigned with labels of the categorisation scheme MAN and FEAT. GEN BIOL = general biological terminologies; GEN ENG = general English words.

115 ”Mutation K241Q completely abolishes DNA glycosylase activity and covalent complex formation in the presence of NaBH4.” (PMID:9241232), the following relation can be identified

mutation K241Q—abolish—covalent complex formation.

A semantic label can be assigned to the relation argument ”covalent complex formation” because the term ”complex formation” is labelled in the reference set.

6.2 Evaluation methods

The extraction of contextual features of residues results in a set of syntactical relations, which are represented as PAS. The performance of this extraction module was evaluated by comparing the returned PAS data with manual annotations in the gold standard test corpus (cf. section 5.2). A true positive was counted, if the syntactical relations in a PAS were correct, and if the arguments in the PAS contained the annotated residue entity and the marked keyword(s) in the test corpus. If any of these conditions were not met, then a false positive was registered. The performance was measured in terms of precision, recall and F1-measure, as described earlier in section 5.3. The performance of the developed classification method was evaluated by a 100 times 5-fold cross-validation. For each iteration, terms in the reference set were shuffled, and partitioned into a test set (1/5 of the data) and a training set (4/5 of the data). The average precision, recall and F1-measure (cf. section 5.3) were calculated for each classifier from the determined confusion matrix.

116 PAS Method Available Extracted Common Precision Recall F1

Shallow parsing 117 82 56 0.68 0.48 0.56 Full parsing 117 86 32 0.37 0.27 0.31

Table 6.3: Evaluation of syntactical language parser performance. The performance of the two language parsers (shallow and full parsing) were evaluated on the basis of precision, recall and F1 measures by comparing the annotated PAS data in the test set with the returned PAS output from the parsers.

6.3 Results

In this section, the performances of contextual feature extraction and categorisation are studied. The test dataset is the gold standard corpus.

6.3.1 Contextual feature extraction evaluated

The objective in contextual feature extraction is to find textual features that are suitable as functional annotations for protein residues. In this section, the performance of this extraction system is studied by comparing the results produced with two different language parsers: the shallow parser, and the full parser. Sentences from the gold standard corpus (GC) were used as test dataset for this analysis. Within this study, the analysis determined that the developed shallow parser has a better performance than the full parser ENJU. The shallow parser yielded in a F1 measure of 0.56 (precision of 0.68 and recall of 0.48), while the full parser ENJU has a F1 measure of 0.31 (precision of 0.37 and recall of 0.27) (cf. table 6.3). The results suggest that contextual information of a residue entity can be extracted from a syntactical analysis with a F1 measure of 0.56 and 0.31 for shallow parsing and full parsing, respectively.

117 6.3.2 Performance analysis of the classifiers

One problem in functional annotation extraction is the semantic interpretation of the extracted text data. The solution proposed in this work, is based on a classification approach. Two different categorisation schemes were tested in this study: MAN and FEAT. The performance of the developed classification method was evaluated by repeated cross-validation studies. Table 6.5 summarises the results from the determined confusion matrix (cf. table 6.4). For MAN, the top three performing classifiers with F1 measures of 0.62, 0.57, and 0.57 are STR COMP (precision of 0.56, recall of 0.69), CHEM MOD (precision of 0.54, recall of 0.59) and BINDING (precision of 0.63, recall of 0.52). The average performance of the whole classification system for this categorisation scheme yielded in an average precision of 0.48 and an average recall of 0.42. In comparison the classification based on FEAT has a much lower average performance: average precision of 0.24, average recall of 0.18. The weak performances of the FEAT classifiers is explained by the distribution of examples in the categories; for some categories the number of corresponding features or examples is low (cf. table 6.2). A discussion is presented in section 6.4 Examining the false positive rate in the confusion matrix of MAN reveals that the clas- sifiers are confused with the category GEN BIOL (general biological terms) or GEN ENG (general English terms). This is not surprising considering that English terms are am- biguous. In addition, some categories show confusions with others, e.g. STR COMP with CHEM MOD, and ENZ ACT with STR COMP. One explanation is that some terms can be assigned to more than one category. For example, ”mutant structure” refers to an altered protein structure state, which is based on a chemical change in the protein sequence. Despite the average performances of some classifiers, the presented method can be used to assign categories to textual features. However, significant improvements on the performances of some classifiers are necessary before the system can be used automatically.

118 Prediction BINDING GEN BIOL CELL CHEM MOD GEN ENG ENZ ACT STR COMP STR MOD

BINDING 1,772 762 28 93 165 26 546 0 A | GEN BIOL 560 15,815 525 1,496 4,514 159 1,714 65 c | CELL 96 1,167 836 150 325 91 67 0 t | CHEM MOD 38 1,103 12 3,742 761 79 546 25

119 u | GEN ENG 144 2,556 126 510 1,820 46 480 35 a | ENZ ACT 33 338 80 201 226 324 457 0 l | STR COMP 160 783 64 551 592 35 4,914 11 STR MOD 1 91 1 129 125 0 21 43

Table 6.4: Performance analysis of the classifiers (confusion matrix). Classification with categories from MAN were analysed by cross-validation studies with 100-iterations. The result is represented as a confusion matrix. MAN FEAT Category Precision Recall F1 Category Precision Recall F1

STR COMP 0.56 0.69 0.62 DOMAIN 0.50 0.24 0.32 MOTIF 0.98 0.36 0.53 TOPO DOM 0 0 0 CHAIN 0 0 0 TRANSMEM 0 0 0 COIL 0 0 0 CHEM MOD 0.54 0.59 0.57 VARIANT 0.50 0.69 0.58 MOD RES 0.40 0.23 0.29 PEPTIDE 0.05 0.06 0.05 VAR SEQ 0 0 0 LIPID 1 0.32 0.48 CARBOHYD 0 0 0 STR MOD 0.24 0.10 0.15 REGION 0.44 0.44 0.44 SITE 0.40 0.55 0.46 BINDING 0.63 0.52 0.57 BINDING 0.41 0.45 0.43 METAL 0.05 0.02 0.03 DISULFID 0.53 0.15 0.23 CROSSLNK 0 0 0 DNA BIND 0 0 0 NP BIND 0 0.06 0 ZN FING 0 0 0 CA BIND 0 0 0 ENZ ACT 0.43 0.20 0.27 ACT SITE 0.45 0.31 0.36 CELL 0.50 0.31 0.38 N/A GEN BIOL 0.70 0.64 0.67 GEN BIOL 0.76 0.65 0.70 GEN ENG 0.21 0.32 0.26 GEN ENG 0.23 0.32 0.27

0.48 0.42 0.43 0.25 0.18 0.19 Average Average

Table 6.5: Performance evaluation of the classifiers (precision, recall, F1 measure).Evaluation of clas- sification of textual features (noun phrases). Classification with categories from MAN and FEAT were analysed by cross-validation studies with 100-iterations. The performance was measured in terms of precision, recall, and F1 measure.

120 One option is to increase the number of training data, or the size of features for each classifier. Another alternative is to modify the definition of classes. The results suggest that the algorithm is, in generally, suitable for classification.

6.4 Discussion

The presented text mining solution extracts textual features from the context of residue entities. The identification of the contextual features, and the association with the residue entity, is based on the syntactical analysis of the sentence. More specifically, only a subset of semantic relations that are found in verbal and prepositional relations are extracted from text. The advantage of this approach is, that not only the semantic relation partners and the semantic relation type are found, but also contextual information is extracted. Within this study two approaches in syntactical analysis were compared, i.e. shallow parsing and full parsing, while the result indicates that the ENJU parser had a weaker performance than the developed shallow parser. Manual analysis on the false positive rate indicates that the source of incorrectly determined syntactical structure originates from false part-of-speech tagging. For example, in the sentence

”Conversely, K382Q displays a highly altered responsiveness to the acti- vator, suggesting that Lys(382) is involved in both activator binding and allosteric transition mechanism.” (PMID:10751408), both parsers identified ”altered” as a verb in past tense, although the correct POS is a noun modifier. The performance of the POS tagger is critical for the detection of phrase boundaries. However, both parsers rely on two different methods for POS tagging and the performance of the POS tagger has to be considered as well when comparing the shallow and full parser. Table A.1 lists some examples, where a parser failed in extracting the annotated PAS data from GC.

121 The extracted information is difficult to normalise, because there is no gold standard of how to represent the association, and how to qualify the contextual information. In this work, the predicate-argument structure is used as a template for the extracted infor- mation. Although verb frame sets from PropBank or PASBio can be used to normalise the extracted data, they are not designed to capture description of protein residue func- tion. On the other hand, this gives the extraction method the advantage to discover new knowledge. Because the extracted information is not normalised, the performance can only be measured in terms of sensitivity. The evaluation of the classification method indicates, that the presented approach can provide an automatic solution for text interpretation. However, some of the categories have only few examples, which is reflected in weak performances of the classifiers. One solution to this problem is to balance the example sets of each category, for example, by collecting more terminologies from MEDLINE. Alternatively, other categories may be defined to balance the ratio between a category and the associated set of examples. Yet another approach is not to classify arguments of a PAS, but cluster them based on their, for example, contextual usage. The advantage here is to find more information similarities among the PAS data by overcoming the information representativeness of a training (reference) set. Despite the fact, that semantic labels can be assigned to the arguments in a PAS, the developed method is not able to interpret the meaning of the whole extracted text segment. For example, in the sentence

”Specific binding of the WT and mutant receptors Cys14Ala and Cys199Ala was inhibited in the presence of the disulfide bond reduc- ing agent, DTT, implying that disulfide bonds are formed and can be reduced in these mutant receptors.” (PMID:9202220).

The following information was extracted and semantic categories were assigned to the

122 arguments of the PAS

pred = inhibited arg1 = Specific binding arg1-of = [the WT and mutant receptors CYS14 ALA and CYS199 ALA]/CHEM MOD arg2-in = the presence arg2-of = the disulfide bond reducing agent.

Although one part of the information in the example has been correctly assigned with the label CHEM MOD, the entire text phrase should be labelled with BINDING. A solution to this problem is not trivial and requires several levels of linguistic analysis.

6.5 Conclusion

In this chapter, I have presented the developed contextual feature extraction system for the annotation of residue entities. Because a suitable terminological resource is not avail- able, the identification of functional annotation is based on the extraction of syntactical relations between a residue entity and a noun phrase. The developed method allows the discovery of novel information that can provide key information for functional annota- tion. In the next chapter, I will demonstrate the validity of the extracted information as functional annotation of protein residues.

123 Chapter 7

Extraction of functional annotation for protein residues from MEDLINE

In the previous two chapters, two fundamental text mining components for the functional annotation extraction were presented. In this chapter, I provide results of the combined extraction result, and assesses the performance of the combined system. The objective in this study is to determine the qualitative and quantitative distribution of information in MEDLINE. Because the information is derived solely from biomedical abstract texts, it is necessary to examine the data in terms of validity, novelty, and biological significance. In the first part of the evaluation, the performance of the functional annotation ex- traction is studied on the gold standard corpus. Then the biological significance of the extracted data from MEDLINE is studied on two example proteins, the suppressor protein p53, and the Janus kinase 2 protein. Finally, the distribution of information is examined by two specific analysis: the cross-validation of identified active site residues with CSA, and the cross-validation of binding residues with MSDsite.

124 7.1 Evaluation methods

The evaluation of the functional annotation extraction system was based on the perfor- mance analysis of its extraction components: protein residue identification, and contextual feature extraction (cf. section 5.3 and section 6.2). The analysis on the biological validity of the mined functional annotations was done by manual analysis. For each protein residue, the set of extracted annotations was reviewed and grouped by similar topics. Because a set of annotations for each associated protein residue can be very large, random samples were drawn from a list of annotations sorted by residue name and position. The result is a set of sample annotations for each extracted residue of a protein. The information was compared with the corresponding annotations in UniProtKB. The validation of catalytic residues was done by cross-validation with CSA [PBT04]. The analysis was performed on three levels, i.e. the comparison of identified protein residues from MEDLINE with CSA, comparison of residues with extracted functional an- notations, and comparison of residues with extracted annotations classified as ENZ ACT (cf. section 6.1.2). The residues were compared by using the combination of the identifiers RID+UID (cf. section 5.3). The validation of binding residues from MEDLINE extraction was done accordingly. The third level of validation compared residues with extracted annotations classified as BINDING.

125 7.2 Results

7.2.1 Evaluation of the developed functional annotation extrac-

tion system

The presented functional annotation extraction system consists of two basic modules: identification of protein residues, and contextual feature extraction. The following de- scribes an analysis of the overall performance of the combined text mining system. The test set is the gold standard corpus (GC; cf. section 5.2). The evaluation was done in two respects: manual validation of extracted information, and cross-validation with UniProtKB annotations.

Manual validation of extracted information. The gold standard corpus consists of 100 abstract texts with tri-occurrences of the triplet protein, residue and organism. However, manual analysis identified only 51 abstract texts with residue entities that can be associated with their proteins and hosting organisms. The number of associations (OPR) is 172. This represents the target for protein residue identification. Corresponding to these OPRs is the set of functional annotations (PAS data). For 109 out of 172 OPRs, keywords were co-mentioned in verbal relations. The number of PAS associated with the 109 OPRs is 117. This represents the target of functional annotation extraction. Figure 7.1 summarises the performance of the functional annotation extraction. With a previously determined precision of 0.82 and a recall of 0.38, the protein residue iden- tification module detects 79 OPRs with 65 out of 79 being the correct ones. Contextual feature extraction for these 65 protein residues resulted in 35 PAS data. In comparison with the 117 annotated PAS of the 109 OPRs, only 16 out of 35 extracted PAS are true positives. However, the total number of extracted PAS is 46, which results in a precision of 0.35 and a recall of 0.13. A systematic analysis revealed, that the rate of false positives

126 PAS data Dataset Available Extracted Common Precision Recall F1

GC 117 46 16 0.35 0.13 0.25

Figure 7.1: Performance evaluation of the functional annotation extraction system. The performance is dependent on the two combined text mining modules: protein residue identification; and contextual feature extraction. The performance was measured in terms of precision, recall, and F1 measure

127 has the following sources: a false positive of OPR with extracted PAS, a true positive OPR with no annotated PAS, and a true positive of OPR with false positive of PAS. In comparison, if the system would have identified all protein residues correctly, the performance of the whole extraction would have yielded in a precision of 0.68 and a recall of 0.48 (cf. section 6.3). Considering, the presented text mining solution is a pilot approach to extract functional annotations for the validation of predicted functional sites, the result is good for this area and comparable to first studies in BioCreAtIvE or Critical Assessment of Techniques for Protein Structure Prediction (CASP). The recall can be explained by the performance of the contextual feature extraction module. The result indicates, that the extracted functional annotations have a reasonable pre- cision in this first attempt of functional annotation extraction, but is low in coverage. This can be explained by the sum of the performances of each text mining module. On one hand, an incorrectly determined protein residue leads to a false positive of PAS. On the other hand, a failed entity recognition contributes to the false negative rate. In addi- tion, language complexity, and incorrectly parsed sentences are the other reasons for the false positive and false negative rate of functional annotation extraction. In conclusion, the presented functional annotation extraction system delivers precise information, but has a low coverage of extraction. However, in context of the bioinfor- matics work of this thesis, a precision-driven extraction system is prefered over a recall oriented text mining solution.

Cross-validation with UniProtKB functional annotations. Despite the low cov- erage of the functional annotation extraction system, the extracted information is correct and reusable for the annotation of protein residues. Table B.1 lists the 16 verified PAS data, corresponding to 17 verified protein residues. A comparison with UniProtKB shows, that 5 out of 16 are rediscovered knowledge. The remaining 11 out of 16 contain novel information that can be used to update the protein knowledge base. The extraction of functional annotations is a multi-step system. Although the per-

128 formances of each module may not be at optimal level, the results demonstrate that functional annotations are available and extractable from MEDLINE.

7.2.2 Studying mined functional annotations for the proteins

p53 and Jak2

UniProtKB curates functional annotations for proteins on three levels: protein level, protein domain level, and protein residue level. The objective in this section is to study the validity and novelty of mined functional annotations from whole MEDLINE extraction. The result provides an indication of the biological significance for automatic extraction from MEDLINE. The annotations of two example proteins, p53 and Jak2, are analysed and compared with relevant information from UniProtKB.

Tumour suppressor protein p53. p53 plays a critical role in preventing human can- cer formation. In the native state, the protein assembles to a tetrameric phosphoprotein. It consists of four functional domains: (1) the proline-rich, acidic, N-terminus, which is involved in transcriptional activation, e.g. Mdm2 binding; (2) the central core, which binds DNA; (3) the oligomerisation domain with nuclear localisation signals, which al- lows the transfer into the nucleus; and (4) the C-terminus, which regulates DNA-binding [SYH+03]. The extraction of functional annotations from MEDLINE for the human tumor protein p53 resulted in 1,665 PAS data. A manual analysis on samples of mined functional annotations indicates, that there are two main topics: the regulatory post-translational modification, and the binding activity of residues, where in some cases the interaction partner is also stated. Table C.1 lists example annotations grouped by similar topics. For 5 out of 6 of the identified residues with post-translational modification, i.e. THR18, SER46, SER15, THR55, and SER315, the extracted information is similar to the annotations in the UniProtKB entry. The remaining residue, SER6, has no annotation in the UniProtKB.

129 The knowledge base does not provide further information on the biological implication of these residues, while the extracted data contain more contextual information. For example:

”[...]ATM-mediated phosphorylation of the ser15 site of p53[...]” (PMID:14757188),

”[...]Ser46 phosphorylation activates p53-dependent apoptosis[...]” (PMID:17172844).

The analysis also found annotations for some critical residues that are not recorded in UniProtKB. For example:

”[...]the amino acid change C135R generates the loss of TP53 DNA- binding activity[...]” (PMID:17914575),

”[...]R248W abolish the association with p63[...]” (PMID:11172034).

The activity of p53 is thought to be regulated through a number of post-translational modifications at the N- and C-terminal regions. Review articles report that seven serines (SER6, SER9, SER15, SER20, SER33, SER37, and SER46) and two threonines (THR18, and THR81) in the N-terminal domain are modified by kinases upon exposure of cells to ionising radiation or UV light. The analysis shows that MEDLINE extraction can recover this information for the residues SER6, SER15, SER46, and THR18.

Janus Kinase 2 (Jak2). Jak2 plays a crucial part in various growth factors and cy- tokine signalling pathways. Similar to other protein tyrosine kinases of the Janus kinase family, Jak2 consists of a tyrosine kinase domain and a tyrosine kinase-like domain. It is thought that the kinase-like domain can negatively regulate the kinase domain.

130 The set of extracted functional annotations for Jak2 has the size of 624 PAS data, and contains only information on seven residues: L539 (1 annotation), W515 (1 annotation), K607 (2 annotations), V617 (630 annotations), F617 (5 annotations; a reported variant associated with Budd-Chiari syndrome), V678 (3 annotations), and D816 (1 annotation). A comparison with UniProtKB data shows, that the extracted information for F617, K607, and L539 are similar to the annotations in the database. These and other annotations for D816, V678, and W515 describe mutation events (data not shown). In order to assess the extracted information on V617, random samples were selected and studied manually. The result of the analysis indicates, that the set of annotations contains a lot of redundant information. The data can be grouped into two main top- ics: disease, and genetical origin. Table D.1 lists some examples of extracted functional annotations. The effect of mutating residue 617 on cellular function, and its association with partic- ular diseases has already been reported, but none of the extracted annotations provide any molecular explanation. A survey of research publications on Jak2 revealed, that myeloid and lymphoid malignancies are associated with Jak2 V617F. It is proposed, that the residue 617 destabilises the kinase and kinase-like domain interactions, and thereby pro- motes activation of kinase activity [POHS05]. These results suggest that the extracted information reflects pieces of evidences, however, their biological relations may not be available in the mined output or even in MEDLINE. In summary, the study of the mined functional annotations of residues for the two pro- teins presented here indicates, that MEDLINE contains information, which are recurrent in a number of abstract texts. Despite the data redundancy, some functional annota- tions are not contained in UniProtKB, indicating that MEDLINE extraction retains its originality.

131 7.2.3 Cross-validation of mined catalytic residues with CSA

In the previous section, functional annotations were extracted from MEDLINE, and for a range of annotations, the contained information was analysed on its biological validity and novelty. This section focuses on enzyme-related information in the extracted annotations. The objective is to study how reliable the extracted information is for the validation of catalytic residues. The identified residues with these associated annotations are compared with CSA. Figure 7.2 summarises the result of this analysis. The CSA lists 12,971 protein residues (RID+UID), of which 799 were identified in MEDLINE. The missing 12,172 protein residues in CSA can be explained by the perfor- mance of the identification system (cf. section 5.4). Another explanation is, that CSA is curated from full-text publication extraction, and the same information may not be available in MEDLINE. By selecting residues with extracted functional annotations from MEDLINE, 691 out of 799 protein residues were retained. This result indicates that a lot of functional de- scriptions are available as contextual features of the identified protein residues. The result is consistent with previous performance evaluation studies (cf. section 6.4). With a pre- cision of 0.43 and recall of 0.20, the classifier for the category ENZ ACT (cf. section 6.3) identified enzyme-related functional annotations for 77 out of 691 protein residues. Man- ual analysis shows, that this reduction can be explained by the classifier’s performance. Another explanation is the absence of relevant contextual cues in the extracted text. A search for the term ”catalytic triad” in the sentences of the identified protein residues yielded in a sub-selection of 221 out of 46,750 residues. A comparison with CSA shows, that 44 out of 221 are re-discoveries of active site residues. The annotations for the remaining 177 may contain supporting evidences to identify the residues as catalytic. A systematic analysis of these predicted catalytic residues should start with the 27 out of 177 residues, which have annotations classified as ENZ ACT. In conclusion, the developed text mining system rediscovers active site residues, by

132 Figure 7.2: Cross-validation of text mined catalytic residues with CSA. The analysis was done based on the comparison of the determined RID+UID pairs. The numbers reflect the determined RID+UID pairs. RID = Residue identifier; UID = Uniprot identifier.

133 Figure 7.3: Cross-validaiton of text mined binding residues with MSDsite. Annotation was studied on the level of using solely the mentioned protein residue, the residue with PAS data, and residue with information on binding. The number indicates the counted RID+UID pairs in the data. RID = Residue identifier; UID = Uniprot identifier. solely mining abstract text from MEDLINE. While the rate of false positive is not known, the extraction identified 1,391 protein residues with enzyme-related functional annota- tions. The significance of these potentially new CSA residues are further studied in ongoing work.

7.2.4 Annotation of protein residues in MSDsite

The MSDsite [GDO+05] holds a number of predicted ligand binding sites, by automatically analysing ligand contacting residues in the PDB. The objective in this section is to analyse how many of these binding residues can be annotated from mining MEDLINE.

134 The analysis shows that 512 out of the 46,750 identified protein residues in MEDLINE are also contained in MSDsite (cf. figure 7.3). A large proportion of these residues are associated with PAS data (429 out of 512), while only a smaller subset of 12 have informa- tion classified as BINDING. Manual analysis shows, that all of these 12 annotations are correct. They can be used to validate the predicted ligand binding residues in MSDsite (table E.1). For the remaining 417 out of 512 residues, the associated PAS data may still contain valid information for the annotation. However, a systematic analysis was not performed at this stage of study. In summary, a relatively small set of protein residues recovered from MEDLINE ex- traction can be used for the annotation of MSDsite entries.

7.3 Discussion

The extraction of functional annotation is a multi-step process, and the quality of the result has to be interpreted in context of each subprocess’ performance. Although the performances of each extraction module may not be at optimal level, the evaluation results indicate that the mined output contains biologically meaningful data. Considering the validation of a predicted function requires any evidences of biological function, the devel- oped text mining system can become a valuable tool, for example for the protein function prediction assessement in the Critical Assessment of Techniques for Protein Structure Pre- diction (CASP) [LRTV07]. With the improvement of the information extraction modules, the quality of mined functional annotations is expected to become more reliable. The biological relevance of the extracted functional annotation was demonstrated on two different proteins, p53 and Jak2. The results show, that not only information in UniProtKB can be rediscovered from MEDLINE, but also novel information can be ex- tracted as well. These functional annotations can be considered to complement existing annotations in UniProtKB. However, manual analysis on subsets of the extracted annota-

135 tions indicates, that the information is represented redundantly in MEDLINE. One major reason is, that biological facts are expressed repeatedly within the biological community. The study of identifying catalytic residues and binding residues from the mined func- tional annotations, and the cross-validation with CSA and MSDsite shows, that the de- veloped text mining solution is able to find relevant data from MEDLINE. Although the developed classifiers have a weak performance, it is not clear whether this explains com- pletely the cross-validation results. It is possible, that key information is not mentioned in abstract texts that would identify the biological role of the protein residues. Another explanation is based on the protein residue identification performance, which had been evaluated with a low recall score. Although abstract texts cover only a subset of information from full-text articles, and information is represented repeatedly in MEDLINE, this study shows that the text mined information is biologically valid and contains snippets of additional information that are relevant for UniProtKB. For example, the extracted annotations complement existing information in UniProtKB and provide first data of yet not curated functional sites in proteins.

7.4 Conclusion

In this chapter, two text mining components were combined to form the functional an- notation extraction system. Performance analysis shows, that the system is precise, but has a low coverage. However, the low recall is compensated by the fact, that information is distributed redundantly. The extracted information is biologically valid, and contains some novel data, which can be used to update UniProtKB. So far, functional annotations of residues have been evaluated in isolation, i.e. independent from structural context in proteins. In the following chapter a biological context is created, by combining functional annotations with protein structure data (cf. chapter3 and chapter4).

136 Chapter 8

Combining active site prediction with mined functional annotations

The goal in this thesis is to combine information from two disjoint information resources. In this course various methodologies were developed for the prediction of functional sites in proteins, and the extraction of relevant information for the functional annotation of protein residues from scientific articles. More specifically, a predicted functional site can be validated by a set of functional annotations of protein residues. Conversely, a set of functional annotations requires a structural context to understand the molecular mechanism of a protein function. In the previous chapters, I have presented the results on 3D pattern mining from PDB (cf. chapter3) and functional annotation extraction from MEDLINE (cf. chapters5,6, and7). Here, the produced datasets are combined and analysed. The objective in this chapter is to validate predicted active sites that the data mining output may contain, by combining specific functional annotations extracted from MEDLINE. The result is compared with data from CSA.

137 Figure 8.1: Overview of processes and evaluation methods of combining the protein structure dataset and literature dataset. 8.1 Algorithms

8.1.1 Combining protein structure data with literature data

Theory

The method to combine PDB with MEDLINE data, i.e. the functional annotation of a residue from a protein structure, is based on the combination of two identifiers: RID+UID (cf. section 5.3). There are two major subtasks to combine the datasets (cf. figure 8.1): linking PDB entries to a Uniprot entry, and associating a residue with its co-mentioned protein in text.

Mapping residues in PDB to UniProtKB. The mapping between PDB and UniProtKB, and the inherited mapping of a protein residue from a PDB entry to its UniProtKB se- quence index, is a non-trivial task. One problem is that the author of a determined protein structure used an arbitrary residue index system that is not in accordance with the wild-

138 type protein sequence. Furthermore, residues in a protein deletion mutant may have been numbered sequentially, irrespectively of sequence gaps. Another example is, that UniProtKB does not have the corresponding protein sequence for a crystallised protein, which may be, for example, a novel splice variant. In some cases, cross-links from PDB to UniProtKB, or UniProtKB to PDB are avail- able. However, over time the links may have become outdated. In order to find the correct mapping between the protein residue indices in both databases, an exhaustive sequence alignment is required. Various solutions and services have been provided for the periodic update of UniProtKB-PDB mappings [VMMR+05][Mar05][VZHC05][MSD08]. Here, I reuse a previously published lookup table file [Mar05] for the mapping of protein residues in PDB to UniProtKB. Notice, that the lookup table is based on the alignment analysis work of the Macromolecular Structure Database (MSD) group at the European Bioinformatics Institute [MSD08].

Mapping protein residue in text to UniProtKB. The mapping of a residue en- tity in text to its co-mentioned protein, and ultimately the mapping to UniProtKB, is explained in section 5.1.

Implementation

The correct sequence index mapping of a PDB entry to its corresponding Uniprot entry was based on the lookup table produced by [Mar05] (version October 2008). An example of the lookup table data is shown in figure 8.2. The combination of the following keys were used to unambiguously map a residue from PDB to its Uniprot native sequence position: PDBID + chainID + RID.

139 PDB UniProtKB PDBID chainID serial resName resSeq UID resName seqIndex

11gs B 1 PRO 2 GSTP1 HUMAN P 3 11gs B 2 TYR 3 GSTP1 HUMAN Y 4 11gs B 3 THR 4 GSTP1 HUMAN T 5 11gs B 4 VAL 5 GSTP1 HUMAN V 6 11gs B 5 VAL 6 GSTP1 HUMAN V 7 11gs B 6 TYR 7 GSTP1 HUMAN Y 8 11gs B 7 PHE 8 GSTP1 HUMAN F 9 11gs B 8 PRO 9 GSTP1 HUMAN P 10 11gs B 9 VAL 10 GSTP1 HUMAN V 11 11gs B 10 ARG 11 GSTP1 HUMAN R 12

Figure 8.2: Lookup table for PDB/UniProtKB mapping. Excerpt of the lookup table to map protein residues from a PDB entry to the corresponding UniProtKB entry.

8.2 Evaluation methods

The validation of identified catalytic residues was done by manual examination of the functional descriptions of annotated protein residues. Within this analysis 6 datasets were used (cf. section 7.2): CSA is the set of active site residues from the Catalytic Site Atlas [PBT04]; OLDFIELD is the set of residues in the non-redundant structure set from [Old02]; PATTERN is the set of residues from the data mined 3D patterns; OPR is the set of protein residues identified from MEDLINE extraction; FA is the subset of OPR, which have functional annotations extracted from MEDLINE; and ENZ is the subset of FA, where the contained information are classified as ENZ ACT, i.e. the information are enzyme-related.

8.3 Results

8.3.1 Protein residue mapping between three data resources

This section gives an overview of the analysed datasets. Figure 8.3 summarises the data. OLDFIELD contains in total 341,365 protein residues, counted as RID+PDBID. 328,796 out of 341,365 residues are found in the lookup table, which corresponds to 280,521 RID+UID. Parallely, the residues from the mined 3D pattern set (PATTERN) was

140 Figure 8.3: Overview of the combined datasets from protein structure data and biomedical literature data. The combined dataset is analysed to identify active site residues. CSA = active site database; OPR = identified protein residues; PAS = contextual feature assigned to a protein residue; ENZ = contextual feature with enzyme-related information; OLDFIELD = protein structure subset from PDB; PATTERN = data mined structural features from OLDFIELD.

141 mapped to 24,500 RID+UID. The identification of protein residues in MEDLINE found a total of 132,476 RID+UID with a unique count of 46,750 RID+UID. This dataset is referred as OPR. 36,569 out of 46,750 protein residues have functional annotations (FA), while another subset of 1,467 out of 36,569 have annotations classified as ENZ ACT (ENZ). A set analysis between OLDFIELD and OPR determined 2,402 common protein residues, 197 out of 2,402 also listed in CSA. In summary, for a large fraction of protein residues in OLDFIELD, mapping to UniProtKB sequence indices is available. However, only 2,402 are recovered from MED- LINE extraction, which can be used for validation.

8.3.2 Rediscovery of active sites and catalytic residues

The identification of catalytic residues from protein structure data mining, and from biomedical literature mining was studied previously (cf. sections 4.2 and 7.2). Each result was evaluated by cross-validation with CSA. This section studies the validation of predicted active sites from the combined datasets. Previously, three structural patterns were identified as active sites, by cross-validation with CSA (cf. chapter4). One of the pattern represents the well known catalytic triad. This pattern was found in 19 proteins within the dataset (cf. section 4.2). Associated with these 19 proteins is the set of 57 protein residues. The analysis shows that only 3 out of 57 residues were identified in MEDLINE, The 3 identified residues in text correspond to the same protein, bovine chymotrypsinogen (cf. table 8.1). The associated functional annotations for the residues ASP102, and HIS57, were not classified as ENZ ACT. The contained information in these annotations only indirectly indicate the catalytic property of these residues; the annotations do not mention them as part of the catalytic triad. In conclusion, a structure-based prediction of an active site was not validated by literature data. The intersection of PATTERN, OPR, and CSA results in a set of 15 protein residues.

142 RID+UID S195 CTRA BOVIN; D102 CTRA BOVIN; H57 CTRA BOVIN Sentence ”These include the NH2-terminal four residues, the sequences near histidine-57 (chy- motrypsinogen A numbering system), aspartic acid-102, aspartic acid-189, and serine-195, the regions of the three disulfide bridges, and the COOH-terminal end (residues 225- 229) of the proteins. When aligned to maximize homology the identity of residues is 34%.”(PMID:804314) PAS N/A RID+UID D102 CTRA BOVIN; H57 CTRA BOVIN Sentence ”In bovine chymotrypsinogen A in 2H2O at 31 degrees C, histidine-57 has a pK’ of 7.3 and aspartate-102 a pK’ of 1.4, and the histidine-40-aspartate-194 system exhibits inflections at pH 4.6 and 2.3.” (PMID:31898) PAS pred = has arg1 = HIS57 arg2 = a pK arg2-of = 7.3 and ASP102 a pK arg2-of = 1.4 RID+UID D102 CTRA BOVIN Sentence ”In bovine chymotrypsin Aalpha under the same conditions, the histidine-57-aspartate-102 system has pK’ values of 6.1 and 2.8, and histidine-40 has a pK’ of 7.2.” (PMID:31898) PAS pred = have arg1 = the HIS57 ASP102 system arg2 = pK values arg2-of = 6.1 and 2.8 RID+UID D102 CTRA BOVIN; H57 CTRA BOVIN Sentence ”The results suggest that the pK’ of histidine-57 is higher than the pK’ of aspartate-102 in both zymogen and enzyme.” (PMID:31898) PAS pred = is arg1 = that the pK arg1-of = HIS57 arg2 = higher than the pK arg2-of = ASP102 arg2-in = both zymogen and enzyme RID+UID H57 CTRA BOVIN Sentence ”The 1H NMR chemical shift of the Cepsilon1 H of histidine-57 in the chymotrypsin Aalpha- pancreatic trypsin inhibitor (Kunitz) complex is constant between pH 3 and 9 at a value similar to that of histidine-57 in the porcine trypsin-pancreatic trypsin inhibitor complex [Markley, J.L., and Porubcan, M. A. (1976), J. Mol. Biol. 102, 487–509], suggesting that the mechanisms of interaction are similar in the two complexes.” (PMID:31898) PAS pred = is arg1 = complex arg2 = constant arg2-between = pH 3 and 9 arg2-at = a value similar arg2-to = that arg2-of = HIS57 arg2-in = the porcine trypsin-pancreatic trypsin inhibitor complex

Table 8.1: Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen. Based on the performance of the functional annotation extraction system and the availability of infor- mation in MEDLINE, only few information was extracted. The mined information on the active site residues mention only indirectly their catalytic properties.

143 RID+UID C32 THIO HUMAN; C35 THIO HUMAN Sentence ”A hydrogen bond between the sulfhydryls of Cys32 and Cys35 may reduce the pKa of Cys32 and this pKa depression probably results in increased nucleophilicity of the Cys32 thiolate group.” (PMID:8805557) PAS pred = reduce arg1 = A hydrogen bond arg1-between = the sulfhydryls arg1-of = CYS32 and CYS35 arg2 = the pKa arg2-of = [CYS32 and this pKa depression]/ENZ ACT RID+UID C215 PTN1 HUMAN Sentence ”The structure of the catalytically inactive mutant (C215S) of the human protein- tyrosine phosphatase 1B (PTP1B) has been solved to high resolution in two complexes.” (PMID:9391040) PAS pred = solved arg1 = [inactive mutant (C215S)]/ENZ ACT arg1-of = the human protein-tyrosine phosphatase 1B (PTP1B) arg2 = unk arg2-to = to high resolution arg2-in = in two complexes

Table 8.2: Identified catalytic residues from MEDLINE extraction. The mined functional annotation were classified as enzyme-related, suggesting the correspondent protein residue has some catalytic prop- erties. The identified residues were also cross-validated by CSA, however the mined 3D pattern with these residues were not validated as active site residues by the database.

The analysis shows that only 3 out of 15 protein residues have enzyme-related annotations. 2 out of 3 residues correspond to the protein human thioredoxin (cf. table 8.2). However, none of the mined 3D patterns can provide a structure context to the identified catalytic residues. A manual analysis on the 12 out of 15 residues shows, that some of the associated annotations were not correctly classified as enzyme-related, which can be explained by the performance of the classifier (cf. section 6.3). For 16 out of 197 protein residues, i.e. the intersection between OLDFIELD, OPR, and CSA, the term ”catalytic triad” is found as co-mention within sentences. While none of the 16 residues are associated with a mined 3D pattern, 6 out of 16 residues have enzyme-related functional annotations (cf. table 8.3). In conclusion, the results in this study indicate, that the coverage of relevant infor- mation to validate predicted active sites is too low. However, some of the enzyme-related annotations are biological valid, but have no correlation with a 3D pattern.

144 RID+UID S80 HNL HEVBR; D207 HNL HEVBR; H235 HNL HEVBR Sentence ”Our results yielded further support for an enzymatic mechanism involving the catalytic triad Ser80, His235, and Asp207 as a general acid/base.” (PMID:11354003) PAS pred = involving arg1 = furhter support arg1-for = for an enzymatic mechanism arg2 = [the catalytic triad SER80, HIS235, and ASP207]/ENZ ACT RID+UID E132 LINB PSEPA; D108 LINB PSEPA; H272 LINB PSEPA Sentence ”The enzyme belongs to the alpha/beta hydrolase family and contains a catalytic triad (Asp108, His272, and Glu132) in the lipase-like topological arrangement previously proposed from mutagenesis experiments.” (PMID:11087355) PAS pred = contains arg1 = unk arg1-to = the alpha/beta hydrolase family and arg2 = [a catalytic triad (ASP108, HIS272, and GLU132)]/ENZ ACT

Table 8.3: Catalytic triad residues available from the mined functional annotations. The active site residues were identified by a search for the term ”catalytic triad” in the mined functional annotation data. The validity was also confirmed by comparison with CSA.

8.3.3 Search for novel catalytic residues

In the previous section, the combined dataset was evaluated by cross-validation with CSA. Thus the identified catalytic residues represent only re-discoveries of known data. The goal in this section is to search for novel catalytic residues by combining enzyme-related annotations with mined 3D pattern. A set analysis between CSA, OLDFIELD, and OPR revealed, that 2,205 residues are included in OLDFIELD and OPR, but not in CSA (cf. figure 8.3). A search for the term ”catalytic triad” in sentences of these 2,205 identified residues resulted in a subselection of 24 residues. The analysis shows that none of the 24 residues were found in the mined 3D pattern. However, 15 out of 24 residues have enzyme-related annotations (cf. table F.1), suggesting they are catalytic residues. A manual analysis determined, that the annotations contain valid evidences to identify the residues as catalytic. The result in this study indicates, that MEDLINE extraction can find some additional catalytic residues that are not represented in CSA. However, a correlation with the mined 3D patterns was not found, and functional annotations were not interpreted in a structural context.

145 8.3.4 General correlation found between predicted functional

sites and extract functional annotations.

Previously, the validation of predicted active sites was studied by cross-validation of known catalytic residues. In this section a more general correlation analysis between structure and function data is studied. Because the coverage of extracted functional annotations of protein residues is too low to be useful to annotate the residues of the prediction, we cannot expect that all residues in one prediction are annotated with description of biological function. However, if a predicted functional site has some feature which point to a common concept of function, then this can be used to prioritise the prediction. Table 8.4 (left panel) shows the top 25 mined structural patterns which were ranked by the number of distinct residues with PAS data. In total 168 patterns have annotations ranging from one residue to a maximal of nine distinct residues with annotations. Another view is to take into consideration the number of annotated residues in context of the total number of residues in a prediction (cf. table 8.4, right panel). This gives an indication of how frequent a pattern is and how much do we know on each residue from the text mined data. The extraction of biological features from text for protein residues matches to a num- ber of various proteins, including homologues proteins. So far the annotation of residues in a predicted functional site considered only first level information (annotations for exact protein), however, the correlation analysis can also exploit information from homologous proteins (second level information). Based on the information from the Homology-derived Secondary Structure of proteins (HSSP) database [SS96], the annotation of the prediction was expanded by extracted information from homologues. The result of this study shows, that the number of residue annotation is increased by 10% (cf. table 8.5). A control anal- ysis of how many residues in the non-redundant protein dataset OLDFIELD are identified in MEDLINE and how many of these have an association with PAS data indicates that the low recall of the developed text mining system is the reason for the weak annotation

146 #residues with Pattern #residues in A/B #residues with Pattern #residues in A/B PAS (A) pattern (B) PAS (A) pattern (B) 6 9 10 16 CYS CYS PHE-1 12 0.5 4 10 11 11 ALA HIS HIS-1 6 0.6667 4 10 15 11 ASP HIS TRP-2 18 0.2222 4 9 15 11 GLN LEU TRP-2 6 0.6667 4 10 11 20 HIS MET PHE-1 12 0.3333 6 9 10 16 CYS CYS PHE-1 12 0.5 4 9 18 11 GLY MET TYR-1 12 0.3333 3 10 13 10 CYS PHE TYR-1 6 0.5 4 9 11 17 ALA LEU VAL-1 30 0.1333 4 10 11 20 HIS MET PHE-1 12 0.3333 4 8 9 10 CYS CYS HIS-1 12 0.3333 4 11 18 9 CYS ILE PHE-1 12 0.3333 4 11 8 18 HIS HIS SER-1 12 0.3333 4 11 8 18 HIS HIS SER-1 12 0.3333 4 11 18 9 CYS ILE PHE-1 12 0.3333 4 18 10 10 ASP CYS PHE-1 12 0.3333 4 11 11 12 HIS HIS MET-1 21 0.1905 4 19 11 10 ASP CYS ILE-1 12 0.3333 4 9 15 11 GLN LEU TRP-2 6 0.6667 4 20 9 11 ASP GLY MET-1 12 0.3333 4 10 15 11 ASP HIS TRP-1 15 0.2667 4 8 9 10 CYS CYS HIS-1 12 0.3333 4 10 11 11 ALA HIS HIS-1 6 0.6667 4 9 18 11 GLY MET TYR-1 12 0.3333 4 20 9 11 ASP GLY MET-1 12 0.3333 3 9 10 8 CYS HIS MET-1 9 0.3333 4 18 10 10 ASP CYS PHE-1 12 0.3333 2 11 13 9 ASN LYS SER-1 6 0.3333 4 19 11 10 ASP CYS ILE-1 12 0.3333 2 11 14 8 ALA ARG ASN-2 6 0.3333 147 4 11 14 7 ASP MET SER-1 18 0.2222 2 11 17 10 CYS PHE PRO-1 6 0.3333 4 9 17 10 ALA ILE PHE-1 18 0.2222 2 18 10 11 ARG GLU PRO-1 6 0.3333 3 9 10 8 CYS HIS MET-1 9 0.3333 2 19 9 11 ALA PRO TYR-1 6 0.3333 3 10 13 10 CYS PHE TYR-1 6 0.5 2 9 11 9 ASP CYS LYS-1 6 0.3333 3 21 11 10 CYS GLY VAL-1 21 0.1429 1 10 10 20 HIS PRO TYR-1 3 0.3333 3 11 9 9 ASP MET SER-1 15 0.2 1 10 12 11 ILE LEU PHE-1 3 0.3333 3 17 11 9 ALA LEU VAL-1 102 0.0294 1 14 8 7 ASP HIS SER-1 3 0.3333 3 10 10 19 ALA HIS MET-1 18 0.1667 1 8 11 17 GLU THR THR-1 3 0.3333 3 8 8 15 ASP HIS SER-1 33 0.0909 4 10 15 11 ASP HIS TRP-1 15 0.2667 3 10 9 11 CYS VAL VAL-1 33 0.09099 4 10 15 11 ASP HIS TRP-2 18 0.2222

Table 8.4: Functional annotations of protein residues in predicted functional sites. A functional site is predicted as a structure pattern that is recurrent among a non-redundant set of proteins. The table on the left panel lists the top 25 patterns ranked by the total number of annotated protein residues for each pattern, while the table on the right panel ranks the pattern by the total number of annotated protein residues in context of total number of residues found in all structure examples. Residue Annotations -HSSP +HSSP OPR FA OPR FA

OLDFIELD 2,402 1,963 243 192 PATTERN 168 132 16 19

Table 8.5: Homology-based transfer of extracted functional annotations for protein residues in the mined pattern data. Based on the HSSP information the identified protein residues and their associated functional annotations were transferred from homologous proteins to the target proteins and residues in the mined structure pattern data. expansion. In conclusion, a general correlation between protein structure and function data is found in this study. The set of available annotations for protein residues is an indication of biological function for a predicted functional site. The biological significance of this result is being investigated further.

8.4 Discussion

The distribution of information in the combined data was studied by a search for active site residues. Another approach in sampling the dataset is the identification of ligand binding residues. A search can be done from the protein structure data, by selecting only residues of an identified metal binding site, and then consulting the literature for relevant annotations. The validation of a predicted active site in this study demonstrates, that the amount of extracted functional annotations was not sufficient for this task. Considering, that the catalytic triad is a well characterised structural feature, the information should be available in MEDLINE. In fact, by searching for the term ”catalytic triad” in the text mined data, several associations between the term and residues can be found. A close examination reveals that some are annotations for homologous proteins with the Asp- His-Ser catalytic triad motif (data not shown). However, the results of the presented studies indicate that the recall of the text mining system is to low to capture sufficiently

148 annotations for protein homologues. Despite the identification of some catalytic residues in this analysis, it must be noted that literature-based verification of predicted active sites cannot rule out the detection of false positives. The absence of a biological evidence in the literature does not mean, that the prediction is wrong, but that simply no knowledge is currently available. Biological research is hypothesis-driven, and therefore not all of the predicted active site residues are expected to be reported in the literature, if they have not been a biological research target.

8.5 Conclusion

In this chapter I performed a correlation analysis between the dataset from protein struc- ture data mining and literature mining. The result in this study suggests, that the combined data have little correlations. For example, a structure-based prediction of an active site had no functional annotations with biological evidences, while the result was cross-validated with CSA. Conversely, literature-based identification of catalytic residues could not be interpreted in an evolutionary conserved structure context, because data mining did not find a suitable recurrent structure pattern.

149 Chapter 9

Conclusions and future work

9.1 Summary of main contributions

The goal of this thesis was to identify functional sites in proteins. For this purpose a novel approach that combines protein structure data mining and literature mining was used. Below is a summary of contributions.

Significance testing of residue interaction is a novel approach to identify statisti- cally significant spatial and chemical configurations of residues. The developed method relies solely on mathematical models, and the analysis shows, that recurrent homologous or convergent structural features can be extracted. More importantly, the mined result contains biologically valid data. For example, 22 proteins with the catalytic triad were identified from cross-validation studies. Altogether, the devel- oped data mining method can be used to discover novel information; the result is a prediction of functional sites.

Identification of protein residues is an important text mining component developed in this study for the extraction of functional annotations. The implemented solution utilises regular expression patterns, and lists of terminologies from UniProtKB and NCBI Taxonomy, in order to find and associate biological entities. Ultimately, an

150 identified protein residue is mapped to a Uniprot protein, which means other ex- tracted information can be integrated into UniProtKB. With a precision of 0.82 and a recall of 0.38, residues can be identified and associated precisely with their Uniprot proteins. From a whole MEDLINE analysis, 15,110 abstract texts were found, that can be used for information extraction of 2,884 UniProtKB/PDB proteins.

Contextual feature extraction is a discovery-driven information extraction approach, to find description of function associated with a residue entity in the text. The de- veloped method extracts from a parsed sentence verbal and prepositional relations of a residue and its contextual features. The Gene Ontology was not used, because it does not contain suitable terminologies for the identification of functional descrip- tions of residues. With a precision of 0.68 and a recall of 0.48, the language parser found 46,750 annotations for the identified protein residues from MEDLINE. Man- ual analysis indicates that some of the extracted annotations are valid, and contain novel information that can be used to update the feature table in UniProtKB.

Annotation of protein structures is the main objective in this thesis. The goal is to create a synthesis between protein structure data and protein function data. The hypothesis is, that the intersection of information from both datasets can lead to the discovery of new biological information. For example, a predicted active site can be validated with evidences from the set of functional annotations. Although cross- validations demonstrates, that mined information from PDB and literature contain correct results, no correlation was found between both datasets. Nevertheless, the text mined information are valid, and 1,391 catalytic residues were found, that can be used to update CSA.

151 9.2 Limitations and future works

During the work of this thesis, various research techniques, and three major analysis components have been developed. Their algorithms, and implementations were explained, their performances analysed, and suggestions for improvement have been made. In the following is a discussion on the improvements for the combined dataset analysis. To biologically validate a predicted functional site with published experimental data results it has to be assumed that the extracted functional annotations from the literature provide sufficient supporting evidence for a biological function. This has been shown to be partly correct for some examples. However, it will probably not work in all cases. My results suggest that other factors have to be considered in order to achieve one of the followings: (1) standardised description of function of protein residues; (2) identification of a representative functional concept of a structural feature; and (3) verification of the validity of the pattern as a consensus functional site, where annotations of other protein examples share the same annotations. Although the verification approach uses the vast and broad covering information from MEDLINE, the analysis indicates that this might not be sufficient for this task. Another serious limitation in the literature-based verification of functional sites is to take into account that our knowledge of the protein function space could be incomplete or even incorrect. Protein structure data mining aims to deliver biologically unbiased results, since 3D pattern mining relies on mathematical models and no biological knowledge is used. The result is a prediction of functional sites. However, the input is biologically biased. Currently, we do not have the complete knowledge of the fold space, which means the actual distribution of structural features may be skewed. As a consequence, the prediction may contain a large fraction of false positives. In the long run, various structural genomics initiatives may expand our knowledge of the fold space. In the meantime, the literature is the main resource of biological evidences to validate predictions. Yet, our knowledge of protein residue function, and even the spectrum of

152 biological function has still to be determined. This can lead to four scenarios: (1) a true functional site is fully supported by evidences (true positive); (2) a true functional site is partly supported by evidences (incomplete knowledge); (3) a falsely predicted functional site is partly supported by evidences (incomplete knowledge); and (4) a falsely predicted functional site is fully supported by contradictory evidences (false positive). While, from a bioinformatical point of view, there is little we can do about this problem, the identification of case (2), (3), and case (4) can propose further biological experiments to find the missing data.

153 Bibliography

[AGM+90] SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. Journal of Molecular Biololgy, 215(3):403–10, 1990.

[AL02] M Ashburner and SE Lewis. On ontologies for biologists: the gene ontology - uncoupling the web. Novartis Foundation Symposium, 2002.

[AMS+97] SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–402, 1997.

[APG+94] PJ Artymiuk, AR Poirrette, HM Grindley, DW Rice, and P Willett. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. Journal of Molecular Bi- ololgy, 243(2):327–44, 1994.

[Att02] TK Attwood. The PRINTS database: a resource for identification of protein families. Brief Bioinform, 3(3):252–63, 2002.

[AZP+05] G Ausiello, A Zanzoni, D Peluso, A Via, and M Helmer-Citterich. pdbFun: mass selection and fast comparison of annotated PDB residues. Nucleic Acids Research, 33:W133–137, Jul 2005.

154 [BFL04] T Binkowski, P Freeman, and J Liang. pvSOAR: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins. Nucleic Acids Research, 32:555–558, 2004.

[BFW+94] A Barth, K Frost, M Wahab, W Brandt, HD Schadler, and R Franke. Clas- sification of serine proteases derived from steric comparisons of their active sites, part ii: ”ser, his, asp arrangements in proteolytic and nonproteolytic proteins”. Drug Design Discovery, 2:89–111, November 1994.

[BGH+00] WC Barker, JS Garavelli, H Huang, PB Mcgarvey, BC Orcutt, GY Srini- vasarao, C Xiao, LL Yeh, RS Ledley, JF Janda, F Pfeiffer, HW Mewes, A Tsugita, and C Wu. The protein information resource (pir). Nucleic Acids Research, 28(1):41–44, January 2000.

[BKL00] SE Brenner, P Koehl, and M Levitt. The astral compendium for protein structure and sequence analysis. Nucleic Acids Research, 28(1):254–256, January 2000.

[BLK+08] E Beisswanger, V Lee, JJ Kim, D Rebholz-Schuhmann, A Splendiani, O Dameron, S Schulz, and U Hahn. Gene regulation ontology (gro): de- sign principles and use cases. Studies in health technology and informatics, 136:9–14, 2008.

[BM05] R Bunescu and RJ Mooney. A shortest path dependency kernel for rela- tion extraction. In Proceedings of the Joint Conference on Human Lan- guage Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP’05), 2005.

[BM06] R Bunescu and RJ Mooney. Subsequence kernels for relation extraction. In Y. Weiss, B. Sch¨olkopf, and J. Platt, editors, Advances in Neural Informa- tion Processing Systems 18, pages 171–178. MIT Press, 2006.

155 [BMC08] BMC. Biomed central. http://www.biomedcentral.com/, November 2008.

[BT03] JA Barker and JM Thornton. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics, 19(13):1644–1649, September 2003.

[BW03] PE Bourne and H Weissig. Structural Bioinformatics (Methods of Biochem- ical Analysis, V. 44). Wiley-Liss, 1 edition, February 2003.

[BW05] CJO Baker and R Witte. Mutation miner - textual annotation of protein structures. CERMM Symposium, 2005.

[BWF+00] HM Berman, J Westbrook, Z Feng, G Gilliland, TN Bhat, H Weissig, IN Shindyalov, and PE Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, January 2000.

[CB94] RR Copley and GJ Barton. A structural analysis of phosphate and sulphate binding sites in proteins. Estimation of propensities for binding and conser- vation of phosphate binding sites. Journal of Molecular Biology, 242:321– 329, Sep 1994.

[CCR+08] BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and B Henrissat. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Research, Oct 2008.

[Cer00] F Cerbah. Exogenous and endogenous approaches to semantic categoriza- tion of unknown technical terms. In in In Proceedings of the 18th Interna- tional Conference on Computational Linguistics (COLING, pages 145–151, 2000.

[CFK+05] BY Chen, VY Fofanov, DM Kristensen, M Kimmel, O Lichtarge, and LE Kavraki. Algorithms for structural comparison and statistical analysis

156 of 3D protein motifs. Pacific Symposium on Biocomputing, pages 334–345, 2005.

[Cha93] P Chakrabarti. Anion binding sites in protein structures. Journal of Molec- ular Biololgy, 234:463–482, Nov 1993.

[CHR+02] JM Castagnetto, SW Hennessy, VA Roberts, ED Getzoff, JA Tainer, and ME Pique. Mdb: the metalloprotein database and browser at the scripps research institute. Nucleic Acids Research, 30(1):379–382, January 2002.

[CK06] IG Choi and SH Kim. Evolution of protein structural classes and pro- tein sequence families. Proceedings of the National Academy of Sciences, September 2006.

[CL64] RV Cochran and LH Lund. On the kirkwood superposition approximation. Journal of Physical Chemistry, 1964.

[CMP05] J Crim, R McDonald, and F Pereira. Automatically annotating documents with normalized gene lists. BMC Bioinformatics, 6 Suppl 1, 2005.

[CMR06] P Corbett and P Murray-Rust. High-throughput identification of chemistry in life science texts. In Computational Life Sciences II, pages 107–118. Springer, 2006.

[CSL+06] FM Couto, MJ Silva, V Lee, E Dimmer, E Camon, R Apweiler, H Kirsch, and D Rebholz-Schuhmann. Goannotator: linking protein go annotations to evidence text. Journal of Biomedical Discovery and Collaboration, 1:19+, December 2006.

[DBAD03] R Day, DA Beck, RS Armen, and V Daggett. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein Science, 12:2150–2160, Oct 2003.

157 [DCG+04] F Diella, S Cameron, C Gemuend, R Linding, A Via, B Kuster, ST Ponten, N Blom, and TJ Gibson. Phospho.elm: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, June 2004.

[DS05] A Doms and M Schroeder. Gopubmed: exploring with the gene ontology. Nucleic Acids Research, 33(Web Server issue), July 2005.

[FGS98] JS Fetrow, A Godzik, and J Skolnick. Functional analysis of the escherichia coli genome using the sequence-to-structure-to-function paradigm: identifi- cation of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidore- ductase activity. Journal of Molecular Biololgy, 282(4):703–711, October 1998.

[FKY+01] C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17 Suppl 1, 2001.

[Fri07] D Frishman. Protein annotation at genomic scale: the current status. Chem Rev, 107(8):3448–3466, August 2007.

[FS98] JS Fetrow and J Skolnick. Method for prediction of protein function from se- quence using the sequence-to-structure-to-function paradigm with applica- tion to glutaredoxins/thioredoxins and T1 ribonucleases. Journal of Molec- ular Biololgy, 281(5), September 1998.

[Fuk98] K Fukuda. Toward information extraction: identifying protein names from biological papers, 1998.

[FWLN94] D Fischer, H Wolfson, SL Lin, and R Nussinov. Three-dimensional, se- quence order-independent structural comparison of a serine protease against

158 the crystallographic database reveals active site similarities: potential impli- cations to evolution and to protein folding. Protein Science, 3(5):769–778, May 1994.

[GDAW03] R Gaizauskas, G Demetriou, PJ Artymiuk, and P Willett. Protein struc- tures and information extraction from biological texts: the pasta system. Bioinformatics, 19(1):135–143, January 2003.

[GDO+05] A Golovin, D Dimitropoulos, TJ Oldfield, A Rachedi, and K Henrick. Msdsite: A database search and retrieval system for the analysis and view- ing of bound ligands and active sites. Proteins: Structure, Function, and Bioinformatics, 58(1):190–199, 2005.

[GH08] A Golovin and K Henrick. Msdmotif: exploring protein sites and motifs. BMC Bioinformatics, 9(1), 2008.

[GJYLRS08] S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining evidence, specificity, and proximity towards the normalization of gene ontol- ogy terms in text. EURASIP journal on bioinformatics & systems biology, 2008.

[Glu91] JP Glusker. Structural aspects of metal liganding to functional groups in proteins. Advances in Protein Chemistry, 42:1–76, 1991.

[GOC06] GOConsortium. The gene ontology (go) project in 2006. Nucleic Acids Research, 34(Database issue), January 2006.

[GPP+03] F Glaser, T Pupko, I Paz, RE Bell, D Bechor-Shental, E Martz, and N Ben- Tal. ConSurf: identification of functional regions in proteins by surface- mapping of phylogenetic information. Bioinformatics, 19(1):163–164, Jan- uary 2003.

159 [Gue96] F Guenthner. Electronic lexica and corpora research at cis. CIS Bericht- 96-100, 1996.

[HBB+08] N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro, C Lachaize, PS Langendijk-Genevaux, and CJ Sigrist. The 20 years of PROSITE. Nucleic Acids Research, 36:D245–249, Jan 2008.

[HBGK03] M Hendlich, A Bergner, J G¨unther, and G Klebe. Relibase: design and development of a database for comprehensive analysis of protein-ligand in- teractions. Journal of Molecular Biololgy, 326(2):607–620, February 2003.

[HFM+05] D Hanisch, K Fundel, HT Mevissen, R Zimmer, and J Fluck. Prominer: rule-based protein and gene entity recognition. BMC Bioinformatics, 6 Suppl 1, 2005.

[HJ99] C Hadley and DT Jones. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure, 7:1099–1112, Sep 1999.

[HLC04] F Horn, AL Lau, and FE Cohen. Automated extraction of mutation data from the literature: application of mutext to g protein-coupled receptors and nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004.

[HMBC97] TJ Hubbard, AG Murzin, SE Brenner, and C Chothia. SCOP: a structural classification of proteins database. Nucleic Acids Research, 25:236–239, Jan 1997.

[HNR+05] ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, and CH Wu. Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics, 21(11):2759–2765, June 2005.

[Hob02] JR Hobbs. Information extraction from biomedical text. Journal of Biomed- ical Informatics, 35(4):260–264, August 2002.

160 [HPS+03] A Harrison, F Pearl, I Sillitoe, T Slidel, R Mott, JM Thornton, and CA Orengo. Recognizing the fold of a protein structure. Bioinformatics, 19(14):1748–1759, September 2003.

[HS94] L Holm and C Sander. The fssp database of structurally aligned protein fold families. Nucleic Acids Research, 22(17):3600–3609, September 1994.

[HS96] L Holm and C Sander. Mapping the protein universe. Science, 273(5275):595–603, August 1996.

[HSSS92] U Hobohm, M Scharf, R Schneider, and C Sander. Selection of representa- tive protein data sets. Protein Science, 1(3):409–417, March 1992.

[HZH+04] M Huang, X Zhu, Y Hao, DG Payan, K Qu, and M Li. Discovering pat- terns to extract protein-protein interactions from full texts. Bioinformatics, 20(18):3604–3612, December 2004.

[IPGK05] VA Ivanisenko, SS Pintus, DA Grigorovich, and NA Kolchanov. PDBSite: a database of the 3D structure of protein functional sites. Nucleic Acids Research, 33:D183–187, Jan 2005.

[JB04] A Jakulin and I Bratko. Testing the significance of attribute interactions. In In ICML, pages 409–416. ACM Press, 2004.

[JGLRS08] S Jaeger, S Gaudan, U Leser, and D Rebholz-Schuhmann. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics, 9(Suppl 8), 2008.

[JIDG03] M Jambon, A Imberty, G DelA˜ c age, and C Geourjon. A new bioinfor- matic approach to detect common 3d sites in protein structures. Proteins: Structure, Function, and Genetics, 52:137–145, 2003.

161 [JK95] J Justeson and S Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, pages 9–27, 1995.

[KCRB07] R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. A workflow for mutation extraction and structure annotation. Journal of Bioinformatics and Computational Biology, 5(6):1319–1337, December 2007.

[KH04] E Krissinel and K Henrick. Secondary-structure matching (ssm), a new tool for fast protein structure alignment in three dimensions. Acta Crystallo- graphica Section D: Biological Crystallography, 60(1):2256–2268, December 2004.

[KJ94] GJ Kleywegt and TA Jones. Detection, delineation, measurement and dis- play of cavities in macromolecular structures. Acta Crystallographica Section D: Biological Crystallography, 50(Pt 2):178–185, March 1994.

[Kle99] GJ Kleywegt. Recognition of spatial motifs in protein structures. Journal of Molecular Biololgy, 285(4):1887–1897, January 1999.

[KN03] K Kinoshita and H Nakamura. Identification of protein biochemical func- tions by similarity search using the molecular surface database ef-site. Pro- tein Science, 12(8):1589–1595, August 2003.

[KNT05] A Koike, Y Niwa, and T Takagi. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7):1227–1236, April 2005.

[KON99] T Kawabata, M Ota, and K Nishikawa. The protein mutant database. Nucleic Acids Research, 27(1):355–357, January 1999.

162 [Las95] RA Laskowski. Surfnet: a program for visualizing molecular surfaces, cavi- ties, and intermolecular interactions. Journal of Molecular Biololgy, 13(5), October 1995.

[LC05] G Leroy and H Chen. Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts: Research articles. Journal of the American Society for Information Science and Tech- nology, 56(5):457–468, March 2005.

[LCM03] G Leroy, H Chen, and JD Martinez. A shallow parser based on closed- class words to capture relations in biomedical text. Journal of Biomedical Informatics, pages 145–158, June 2003.

[LEW98] J Liang, H Edelsbrunner, and C Woodward. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Science, 7(9):1884–1897, September 1998.

[LHC07] LC Lee, F Horn, and FE Cohen. Automatic extraction of protein point mutations using a graph bigram association. PLoS Computational Biology, 3(2):e16+, February 2007.

[LRTV07] Gonzalo Lopez, Ana Rojas, Michael Tress, and Alfonso Valencia. Assess- ment of predictions submitted for the CASP7 function prediction category. Proteins, 69 Suppl 8:165–74, 2007.

[LW91] Y Lamdan and HJ Wolfson. Protein structures and information extrac- tion from biological texts: the pasta system. Computer Vision and Pattern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Con- ference on, pages 22–27, June 1991.

[Mar05] AC Martin. Mapping pdb chains to uniprotkb entries. Bioinformatics, 21(23):4297–4301, December 2005.

163 [MB99] Y Matsuo and SH Bryant. Identification of homologous core structures. Proteins, 35:70–79, Apr 1999.

[MG03] J McCallum and S Ganesh. Text mining of DNA searches. Applied Bioinformatics, 2:59–63, 2003.

[MR03] S Mika and B Rost. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Research, 31:3789–3791, Jul 2003.

[MSD08] MSDmapping. Msdmapping. http://www.ebi.ac.uk/msd-as/ MSDMapping/, November 2008.

[MT05] Y Miyao and J Tsujii. Probabilistic disambiguation models for wide- coverage hpsg parsing. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 83–90. Association for Computational Linguistics, 2005.

[NBD+06] J Natarajan, D Berrar, W Dubitzky, C Hack, Y Zhang, C Desesa, JR Van Brocklyn, and EG Bremer. Text mining of full-text journal ar- ticles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinformatics, 7:373+, August 2006.

[NED03] S Novichkova, S Egorov, and N Daraselia. Medscan, a natural language processing engine for medline abstracts. Bioinformatics, 19(13):1699–1706, September 2003.

[OCR01] MJ Ondrechen, JG Clifton, and D Ringe. Thematics: A simple compu- tational predictor of enzyme function from structure. Proceedings of the National Academy of Sciences, 98(22):12473–12478, October 2001.

164 [Old01] TJ Oldfield. Creating structure features by data mining the PDB to use as molecular-replacement models. Acta Crystallographica Section D: Biological Crystallography, 57:1421–1427, Oct 2001.

[Old02] TJ Oldfield. Data mining the protein data bank: residue interactions. Pro- teins, 49(4):510–528, December 2002.

[OMJ+97] CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, and JM Thorn- ton. CATH-a hierarchic classification of protein domain structures. Struc- ture, 5:1093–1108, Aug 1997.

[PB06] BJ Polacco and PC Babbitt. Automated discovery of 3d motifs for protein function annotation. Bioinformatics, 22(6):723–730, March 2006.

[PBT04] CT Porter, GJ Bartlett, and JM Thornton. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Research, 32(Database issue), January 2004.

[PJYLRS08] P Pezik, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Static dic- tionary features for term polysemy identification. Building and evaluating resources for biomedical text mining, LREC Workshop, 2008.

[PKS06] G Pandey, V Kumar, and M Steinbach. Computational approaches for protein function prediction: A survey. Technical Report 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 2006.

[Plo08] PloS. Public library of science. http://www.plos.org/, November 2008.

[PMC08] PMC. Pubmed central. http://www.pubmedcentral.nih.gov/, November 2008.

165 [POHS05] M Pesu, J O’Shea, L Hennighausen, and O Silvennoinen. Identification of an acquired mutation in Jak2 provides molecular insights into the pathogenesis of myeloproliferative disorders. Molecular Interventions, 5:211–215, Aug 2005.

[RMK+07] ND Rawlings, FR Morton, CY Kok, J Kong, and AJ Barrett. Merops: the peptidase database. Nucleic Acids Research, pages gkm954+, November 2007.

[Ros99] B Rost. Twilight zone of protein sequence alignments. Protein Engineering Design and Selection, 12(2):85–94, February 1999.

[RSAG+08] D Rebholz-Schuhmann, M Arregui, S Gaudan, H Kirsch, and A Ji- meno Yepes. Text processing through web services: Calling whatizit. Bioin- formatics, 2008.

[RSKA+07] D Rebholz-Schuhmann, H Kirsch, M Arregui, S Gaudan, M Riethoven, and P Stoehr. Ebimed-text crunching to gather facts for proteins from medline. Bioinformatics, 23(2), January 2007.

[RSMA+04] D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, G Casari, and H Kirsch. Automatic extraction of mutations from medline and cross-validation with omim. Nucleic Acids Research, 2004.

[Rus98] RB Russell. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. Journal of Molecular Biology, 279(5):1211–1227, June 1998.

[SAR+07] B Smith, M Ashburner, C Rosse, K Bard, W Bug, W Ceusters, LJ Goldberg, K Eilbeck, A Ireland, CJ Mungall, N Leontis, P Rocca-Serra, A Ruttenberg, SA Sansone, RH Scheuermann, N Shah, PL Whetzel, and S Lewis. The

166 OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11):1251–5, 2007.

[SB05] A Schutz and P Buitelaar. Relext: A tool for relation extraction from text in ontology extension. The Semantic Web - ISWC 2005, pages 593–606, 2005.

[SB06] J Schuman and S Bergler. Postnominal prepositional phrase attachment in proteomics. In Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. Association for Computational Lin- guistics, 2006.

[SDC06] A Sidhu, T Dillon, and E Chang. Unification of protein data and knowledge sources. Knowledge-Based Intelligent Information and Engineering Systems, pages 728–737, 2006.

[Sin04] A Singer. Maximum entropy formulation of the Kirkwood superposition approximation. Journal of Chemical Physics, 121:3657–3666, Aug 2004.

[SPIBA03] PK Shah, C Perez-Iratxeta, P Bork, and MA Andrade. Information ex- traction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(1), May 2003.

[SPNW04] A Shulman-Peleg, R Nussinov, and HJ Wolfson. Recognition of functional sites in protein structures. Journal of Molecular Biololgy, 339(3):607–633, June 2004.

[SS96] R Schneider and C Sander. The HSSP database of protein structure- sequence alignments. Nucleic Acids Research, 24(1):201–5, 1996.

[SSR03] A Stark, S Sunyaev, and RB Russell. A model for statistical significance of local similarities in structure. Journal of Molecular Biology, 326(5):1307– 1316, March 2003.

167 [STB06] MH Saier, CV Tran, and RD Barabote. Tcdb: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34(Database issue), January 2006.

[SWS+04] MJ Schuemie, M Weeber, BJ Schijvenaars, EM van Mulligen, CC van der Eijk, R Jelier, B Mons, and JA Kors. Distribution of information in biomed- ical abstracts and full-text publications. Bioinformatics, 20(16):2597–2604, November 2004.

[SYH+03] S Saito, H Yamaguchi, Y Higashimoto, C Chao, Y Xu, AJ Fornace, E Ap- pella, and CW Anderson. Phosphorylation site interdependence of human p53 post-translational modifications in response to stress. Journal of Bio- logical Chemistry, 278:37536–37544, Sep 2003.

[TCS+07] RT Tsai, WC Chou, YS Su, YC Lin, CL Sung, HJ Dai, IT Yeh, W Ku, TY Sung, and WL Hsu. Biosmile: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically gen- erated template features. BMC Bioinformatics, 8:325+, September 2007.

[TMA08] Y Tsuruoka, J Mcnaught, and S Ananiadou. Normalizing biomedical terms by minimizing ambiguity and variability. BMC Bioinformatics, 9(Suppl 3), 2008.

[TOT04] Y Tateisi, T Ohta, and J Tsujii. Annotation of predicate-argument structure on molecular biology text. In First International Joint Conference on Nat- ural Language Processing In the IJCNLP-04 workshop on Beyond Shallow Analyses, March 2004.

[TW02] L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124–1132, August 2002.

168 [VMMR+05] S Velankar, P McNeil, V Mittard-Runte, A Suarez, D Barrell, R Apweiler, and K Henrick. E-msd: an integrated data resource for bioinformatics. Nucleic Acids Research, 33(Database issue), January 2005.

[VZHC05] A Via, A Zanzoni, and M Helmer-Citterich. Seq2Struct: a resource for establishing sequence-structure links. Bioinformatics, 21(4):551–3, 2005.

[WAB+06] CH Wu, R Apweiler, A Bairoch, DA Natale, WC Barker, B Boeck- mann, S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, MJ Mar- tin, R Mazumder, C O’Donovan, N Redaschi, and B Suzek. The universal protein resource (): an expanding universe of protein information. Nucleic Acids Research, 34(Database issue), January 2006.

[WBB+06] DL Wheeler, T Barrett, DA Benson, SH Bryant, K Canese, V Chetvernin, DM Church, M Dicuccio, R Edgar, S Federhen, LY Geer, W Helmberg, Y Kapustin, DL Kenton, O Khovayko, DJ Lipman, TL Madden, DR Ma- glott, J Ostell, KD Pruitt, GD Schuler, LM Schriml, E Sequeira, ST Sherry, K Sirotkin, A Souvorov, G Starchenko, TO Suzek, R Tatusov, TA Tatusova, L Wagner, and E Yaschenko. Database resources of the national center for biotechnology information. Nucleic Acids Research, 34(Database issue), January 2006.

[WBT97] AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hash- ing algorithm for deriving 3d coordinate templates for searching structural databases. application to enzyme active sites. Protein Science, 6(11):2308– 2323, November 1997.

[WD03] G Wang and RL Dunbrack. Pisces: a protein sequence culling server. Bioin- formatics, 19(12):1589–1591, August 2003.

169 [WK07] R Witte and T Kappler. Enhanced semantic access to the protein engi- neering literature using ontologies populated by text mining. International Journal of Bioinformatics Research and Applications, 2007.

[WR97] HJ Wolfson and I Rigoutsos. Geometric hashing: an overview. Compu- tational Science and Engineering, IEEE [see also Computing in Science & Engineering], 4(4):10–21, 1997.

[WSC04] T Wattarujeekrit, PK Shah, and N Collier. Pasbio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics, 5, October 2004.

[YEC+07] S Yoon, JC Ebert, EY Chung, G De Micheli, and RB Altman. Clustering protein environments for function prediction: finding prosite motifs in 3d. BMC Bioinformatics, 8 Suppl 4, 2007.

[YHF+02] H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. Au- tomatic extraction of gene and protein synonyms from medline and journal articles. Proceedings of the AMIA Symposium, pages 919–923, 2002.

[YLPV07] YL Yip, N Lachenal, V Pillet, and AL Veuthey. Retrieving mutation- specific information for human proteins in UniProt/Swiss-Prot Knowledge- base. Journal of Bioinformatics and Computational Biology, 5:1215–1231, Dec 2007.

[YMTT05] A Yakushiji, Y Miyao, Y Tateisi, and J Tsujii. Biomedical information extraction with predicate-argument structure patterns. In SMBM, 2005.

170 Appendix A

Examples of errors in relation extraction.

171 Table A.1: Examples of errors in the relation extraction for the detection of contextual features.

. Sentence ”This observation provides a rationale for the reduced electron-transfer efficiency displayed by the E92K mutant. ” (PMID:10089511) Annotated residue GLU92 Annotated keywords reduced electron-transfer efficiency Annotated PAS pred = diplayed arg1 = the reduced electron-transfer efficiency arg2-by = the E92K mutant TP shallow parsing pred = displayed arg1 = a rationale arg1-for = the reduced electron-transfer efficiency arg2-by = the GLU92 LYS mutant FP full parsing pred = displayed arg1-by = the GLU92 LYS mutant Sentence ”An apparent ’acceptor consensus overlap’ at Ser474 suggests that the mechanism behind the glycosaminoglycan split of TM may involve a competition for substrate between xylosyl- transferase and N-acetylgalactosaminyltransferase.” (PMID:8216207) Annotated residue SER474 Annotated keywords acceptor consensus overlap Annotated PAS pred = suggests arg1 = An apparent ’acceptor consensus overlap’ arg1-at = SER474 arg2 = the mechanism behind the glycosaminoglycan split arg2-of = TM FP shallow parsing pred = suggests arg1-at = SER474 arg2 = that the mechanism arg2-behind = the glycosaminoglycan split arg2-of = TP full parsing pred = suggests arg1 = An apparent ’acceptor consensus overlap’ arg1-at = SER474 arg2 = that the mechanism arg2-behind = the glycosaminoglycan split arg2-of = TM Sentence ”Using this approach, coupled with Edman degradation of the 32PO4-labeled tryptic peptides, and comparison with tryptic peptides analyzed after labeling normal human colonic tissues, we identified ser-52 as the major K18 physiologic phosphorylation site.” (PMID:7523419) Annotated residue SER52 Annotated keywords physiologic phosphorylation site Annotated PAS pred = identified arg1 = unk arg2 = SER52 arg2-as = the major K18 phosphorylation phosphorylation site FP shallow parsing pred = identified arg2 = SER52 arg2-as = the major FP full parsing pred = identified arg1 = we arg2 = SER52

172 Appendix B

Examples of extracted functional annotations compared with UniProtKB

173 Table B.1: Comparison of extracted protein residue annotations from GC with UniProtKB. Mined functional annotations are listed as PAS, while relevant information from UniProtKB are reproduced from the feature table (FT) entry line.

. RID+UID SER15 P53 HUMAN Sentence ”Previous studies have demonstrated that phosphorylation of human

p53

on serine 15 contributes to protein stabilization af- ter DNA damage and that this is mediated by the

ATM family of kinases

.” (PMID:11865061) UniProtKB/FT SER15 MOD RES: Phosphoserine; by PRPK SER15 VARIANT: S->R in a sporadic cancer; somatic mutation. PAS pred = contributes Arg1 = arg1-on = SER15 arg2 = arg2-to = protein stabilization arg2-after = DNA damage and that RID+UID GLU189 CP27B HUMAN, LEU343 CP27B HUMAN Sentence ”The R389G mutant was totally inactive,but mutant L343F retained 2.3% of wild-type activity,and mutant E189G retained 22% of wild- type activity.” (PMID:12050193) UniProtKB/FT GLU189 VARIANT: E-K in VDDR I; 11% of wild-type activity. LEU343 VARIANT: L->F in VDDR I; 2.3% of wild-type activity. PAS pred = retained arg1 = but mutant LEU343 PHE arg2 = 2.3 % arg2-of = wild-type activity PAS pred = retained arg1 = and mutant GLU189 GLY arg2 = 22 % arg2-of = wild-type activity RID+UID CYS260 TGA1 ARATH, CYS266 TGA1 ARATH Sentence ”Furthermore,site-directed mutagenesis of

TGA1

Cys-260 and Cys- 266 enables the interaction with

NPR1

in yeast and Arabidopsis.” (PMID:12953119) UniProtKB/FT C260/C266 DISULFID: (potential). C260 MUTAGEN: C->N; Gain of interaction with NPR1; when associated with S-266. C266 MUTAGEN: C->S: Gain of interaction with NPR1; when associated with S-260. PAS pred = enables arg1 = site-directed mutagenesis arg1-of = TGA1 CYS260 and CYS266 arg2 = the interaction arg2-with = NPR1 arg2-in = yeast and Arabidopsis RID+UID THR13 RUM1 SCHPO, SER19 RUM1 SCHPO Sentence ”Direct in vitro kinase assay using GST-fusion proteins of wild-type as well as various mu- tants of

p25

(

rum1

) demonstrated that

MAPK

phosphory- lates the N-terminal portion of

p25

(

rum1

) and residues Thr13 and Ser19 are major phosphorylation sites for

MAPK

.” (PMID:12135491) UniProtKB/FT THR13 MOD RES: Phosphothreonine; by MAPK SER19 MOD RES: Phosphoserine; by MAPK SER19 MUTAGEN: S->E:reduces activity as a cdc2 inhibitor; when associated with E-13 PAS pred = are arg1 = the N-terminal portion arg1-of = p25(rum1) and residues THR13 and SER19

174 ... continuation of table B.1 arg2 = major phosphorylation sites arg2-for = MAPK

RID+UID THR13 RUM1 SCHPO, SER19 RUM1 SCHPO Sentence ”Together with the fact that replacement of both Thr13 and Ser19 with Glu,which mimics the phosphorylated state of these residues,also significantly reduces the ac- tivity of

p25

(

rum1

) as a

Cdc2 inhibitor

,it was suggested that the phosphorylation of Thr13 and Ser19 negatively regulates the function of

p25

(

rum1

).” (PMID:12135491) UniProtKB/FT THR13 N/A SER19 N/A PAS pred = suggested arg2 = that the phosphorylation arg2-of = THR13 and SER19 PAS pred = regulates arg1 = that the phosphorylation arg1-of = THR13 and SER19 arg2 = the function arg2-of = p25(rum1) RID+UID THR13 RUM1 SCHPO, SER19 RUM1 SCHPO Sentence ”Further evidence indicates that phosphorylation of Thr13 and Ser19 may retain a negative effect on the function of

p25

(

rum1

) even in vivo.” (PMID:12135491) UniProtKB/FT THR13 N/A SER19 N/A PAS pred = retain arg1 = that arg1-of = THR13 and SER19 arg2 = a negative effect arg2-on = the function arg2-of = p25(rum1) RID+UID GLU55 DHMA MYCAV, ASP123 DHMA MYCAV, TRP124 DHMA MYCAV Sentence ”Many residues essential for the dehalogenation reaction are conserved in

DhmA

;the putative catalytic triad consists of Asp123,His279,and Asp250,and the putative oxyan- ion hole consists of Glu55 and Trp124.” (PMID:12147465) UniProtKB/FT GLU55 N/A ASP123 ACT SITE: Nucleophile (by similarity). TRP124 N/A PAS pred = consists arg1 = the putative catalytic triad arg2 = arg2-of = ASP123 PAS pred = consists arg1 = and the putative oxyanion hole arg2 = arg2-of = GLU55 and TRP124 RID+UID CYS48 THIO RAT, CYS152 THIO RAT, CYS73 THIO RAT Sentence ”Thus,

PrxV

mutants lacking Cys(48) or Cys(152) showed no detectable thioredoxin-dependent peroxidase activity,whereas mutation of Cys(73) had no effect on activity.” (PMID:10751410) UniProtKB/FT N/A

175 ... continuation of table B.1

PAS pred = showed arg1 = CYS48 or CYS152 arg2 = no detectable thioredoxin-dependent peroxidase activity PAS pred = had arg1 = whereas mutation arg1-of = CYS73 arg2 = no effect on activity RID+UID GLY43 PPCS HUMAN Sentence ”Highly conserved ATP binding residues include Gly43,Ser61,Gly63,Gly66,Phe230,and Asn258.” (PMID:12906824) UniProtKB/FT N/A PAS pred = include arg1 = conserved ATP binding residues arg2 = GLY43 RID+UID ASN59 PPCS HUMAN Sentence ”Highly conserved phosphopantothenate binding residues include Asn59,Ala179,Ala180,and Asp183 from one monomer and Arg55’ from the adjacent monomer.” (PMID:12906824) UniProtKB/FT N/A PAS pred = include arg1 = conserved phosphopantothenate binding residues arg2 = ASN59 RID+UID GLU50 SHD HUMAN, GLU51 SHD HUMAN Sentence ”

Rab3A

binding-defective mutants of

rabphilin

(E50A) and

Noc2

( E51A) were still localized in the distal portion of the neurites (where dense-core vesicles had accumulated) in nerve growth factor- differentiated PC12 cells,the same as the wild-type proteins,whereas

Rab27A

binding-defective mutants of

rabphilin

( E50A/I54A) and

Noc2

( E51A/I55A) were present throughout the cytosol.” (PMID:14722103) UniProtKB/FT N/A PAS pred = localized arg1 = Rab3A binding-defective mutants arg1-of = rabphilin ( GLU50 ALA ) and Noc2 ( GLU51 ALA ) arg2 = arg2-in = the distal portion arg2-of = the neurites ( where dense-core vesicles RID+UID TRP124 DHMA MYCAV Sentence ”Trp124 should be involved in substrate binding and product (halide) stabilization,while the second halide-stabilizing residue cannot be identi- fied from a comparison of the

DhmA

sequence with the sequences of three

dehalogenases

with known tertiary structures.” (PMID:12147465) UniProtKB/FT N/A PAS pred = involved arg1 = TRP124 arg2 = arg2-in = substrate binding and product (halide) stabilization

176 Appendix C

Examples of extracted functional annotations for the protein p53

177 Table C.1: Examples of literature mined annotations of protein residues in p53. The listed data are grouped by topics.

. regulatory PTM

RID+UID SER6 P53 HUMAN PMID 10930428 PAS pred = creased arg1 = a background arg1-of = constitutive phosphorylation arg1-at = SER6 that arg2 = 10-fold arg2-upon = upon exposure arg2-to = either ionizing radiation or UV light pred = exhibited arg1 = Untreated A549 cells arg2 = a background arg2-of = constitutive phosphorylation arg2-at = SER6 that pred = is arg1 = The relative phosphorylation arg1-of = THR18 arg1-by = VRK2B arg2 = similar arg2-in = magnitude arg2-to = that induced arg2-by = taxol RID+UID THR18 P53 HUMAN PMID 12487430 PAS pred = compared arg1 = that phosphorylation arg1-at = THR18 decreased binding arg1-to = recombinant Mdm2 protein arg2 = arg2-with = the unphosphorylated and the two other single phosphorylated analogues

RID+UID SER46 P53 HUMAN PMID 11030628 PAS pred = regulates arg1 = and phosphorylation arg1-of = SER46 arg2 = the transcriptional activation arg2-of = this apoptosis-inducing gene RID+UID SER46 P53 HUMAN PMID 11875057 PAS pred = hibited arg1 = IR-induced phosphorylation arg1-at = SER46 arg2 = arg2-by = wortmannin RID+UID SER15 P53 HUMAN PMID 14757188 PAS pred = duce arg1 = arg1-in = synergy arg2 = ATM-mediated phosphorylation arg2-of = the SER15 site

178 ... continuation of table C.1 arg2-of = RID+UID SER15 P53 HUMAN PMID 17292432 PAS pred = suppressed arg2 = both NaVO(3)-induced SER15 phosphorylation and accumulation arg2-of = RID+UID SER15 P53 HUMAN PMID 11850826 PAS pred = observed arg1 = Increased phosphorylation arg1-of = SER15 arg2 = arg2-in = heat shocked GM638 RID+UID THR55 P53 HUMAN PMID 10933801 PAS pred = define arg1 = These data arg2 = THR55 arg2-as = a novel phosphorylation site and arg2-for = the first time show threonine phosphorylation arg2-of = human RID+UID THR55 P53 HUMAN PMID 15116093 PAS pred = clarify arg1 = This study arg2 = the biological significance arg2-of = doxorubicin-induced THR55 phosphorylation pred = reduced arg1 = phosphorylation arg1-at = SER15 arg2 = and phosphorylation arg2-at = SER392 RID+UID SER315 P53 HUMAN PMID 9246643 PAS pred = reversed arg1 = but SER315 arg2 = the effect arg2-of = phosphorylation arg2-at = SER392

binding activity

RID+UID PHE19 P53 HUMAN PMID 7926727 PAS pred = are arg1 = PHE19 arg2 = crucial arg2-for = the interactions arg2-between = RID+UID SER20 P53 HUMAN

179 ... continuation of table C.1

PMID 11323395 PAS pred = play arg1 = arg1-of = SER20 arg2 = a key role arg2-in = the dissociation arg2-of = mdm2 arg2-in = response arg2-to = Cr(VI) RID+UID CYS135 P53 HUMAN PMID 17914575 PAS pred = generates arg1 = that the amino acid change CYS135˜ARG arg1-in = the human TP53 arg2 = the loss arg2-of = TP53 DNA-binding activity RID+UID SER315 P53 HUMAN PMID 16784539 PAS pred = dephosphorylates arg1 = both arg1-in = vitro and arg1-in = vivo and arg2 = the SER315 site arg2-of =

protein-protein-interaction

RID+UID SER20 P53 HUMAN PMID 10432310 PAS pred = containing arg2 = phosphate arg2-at = SER20 inhibited DO-1 binding RID+UID SER166 P53 HUMAN PMID 11960368 PAS pred = mutated arg1 = analysis arg1-of = HDM2 proteins arg2 = arg2-at = the consensus Akt recognition sites arg2-at = SER166 RID+UID ARG175 P53 HUMAN PMID 11172034 PAS pred = abolish arg1 = mutations ARG175˜HIS or ARG248˜TRP arg2 = the association arg2-of = RID+UID SER315 P53 HUMAN PMID 7624134 PAS pred = abolished arg1 = arg1-to = alanine ( p53- SER315˜ALA )

180 ... continuation of table C.1 arg2 = phosphorylation arg2-by = cdk2 kinase

biological activity

RID+UID SER315 P53 HUMAN PMID 7624134 PAS pred = required arg1 = SER315 arg1-of = wtp53 arg2 = arg2-for = transcriptional activity arg2-in = vivo RID+UID CYS238 P53 HUMAN PMID 16818505 PAS pred = retains arg1 = ( CYS238˜TYR ) mutant arg2 = functional wild-type RID+UID ARG175 P53 HUMAN PMID 16707427 PAS pred = displayed arg1 = the ARG175˜LEU mutant arg2 = an attenuated tumor suppressor activity arg2-in = the regulation arg2-of = transcription

disease

RID+UID ARG72 P53 HUMAN PMID 10616523 PAS pred = suggests arg1 = The acquisition arg1-of = both mutations ( GLY245˜VAL and ARG72˜PRO ) arg1-in = the transformation arg1-from = transient leukemia arg1-to = overt acute megakaryoblastic leukemia arg2 = a functional role arg2-of = mutant RID+UID ARG72 P53 HUMAN PMID 18181044 PAS pred = sociated arg1 = the development arg1-of = lung carcinoma and that ARG72˜PRO genotype arg2 = arg2-with = a poorer prognosis arg2-of = lung cancer

181 ... continuation of table C.1

molecular stability

RID+UID VAL138 P53 HUMAN PMID 7761089 PAS pred = showed arg1 = The human VAL138 mutant arg2 = temperature-sensitive transformation arg2-of = rat embryo fibroblasts ( REFs ) arg2-in = collaboration assay arg2-with = activated RID+UID ARG249 P53 HUMAN PMID 15703170 PAS pred = duce arg1 = oncogenic mutations HIS168˜ARG and z:resi ty ARG249˜SER arg2 = substantial structural perturbation arg2-around = the mutation site arg2-in = the L2 and L3 loops

182 Appendix D

Examples of extracted functional annotations for the protein Jak2

183 Table D.1: Examples of literature mined annotations of protein residues in Jak2. The listed data are grouped by topics.

. disease

PMID 16896569 RID+UID VAL617 JAK2 HUMAN pred = improved arg1 = The improved knowledge arg1-of = the molecular basis arg1-of = the disease because arg1-of = the discovery arg1-of = the VAL617˜PHE mutation arg1-in = the JAK2 gene arg2 = the molecular diagnosis and PMID 16503548 RID+UID VAL617 JAK2 HUMAN PAS pred = is arg1 = that the JAK2 VAL617˜PHE mutation arg2 = rare arg2-in = patients arg2-with = idiopathic erythrocytosis PMID 16247455 RID+UID VAL617 JAK2 HUMAN PAS pred = reported arg1 = A missense somatic mutation arg1-in = JAK2 gene ( JAK2 VAL617˜PHE ) arg2 = arg2-in = chronic myeloproliferative disorders PMID 18024388 RID+UID VAL617 JAK2 HUMAN PAS pred = is arg1 = The JAK2 VAL617˜PHE point mutation arg2 = rare arg2-in = hypereosinophilic syndrome and/or chronic eosinophilic leukemia

genetic

PMID 15858187 RID+UID VAL617 JAK2 HUMAN PAS pred = had arg1 = All 51 patients arg1-with = 9pLOH arg2 = the VAL617˜PHE mutation pred = is arg1 = VAL617˜PHE arg2 = a somatic mutation present arg2-in = hematopoietic cells

molecular function

184 ... continuation of table D.1

PMID 15970705 RID+UID VAL617 JAK2 HUMAN PAS pred = sociated arg1 = JAK2 ( VAL617˜PHE ) arg2 = arg2-with = constitutive phosphorylation arg2-of = JAK2 and its downstream effectors arg2-as = PMID 16239216 RID+UID VAL617 JAK2 HUMAN PAS pred = duces arg1 = that the homologous VAL617˜PHE mutation arg2 = activation arg2-of = JAK1 and Tyk2 PMID 16384930 RID+UID VAL617 JAK2 HUMAN PAS pred = link arg1 = the presence arg1-in = PV erythroblasts arg1-of = proliferative and antiapoptotic signals that arg2 = the JAK2 VAL617˜PHE mutation arg2-with = the inhibition arg2-of = death receptor signaling PMID 16442619 RID+UID VAL617 JAK2 HUMAN PAS pred = does arg1 = crease arg1-of = expression and kinase activity arg1-of = JAK2 arg1-in = CML cells arg2 = result arg2-from = the JAK2 VAL617˜PHE activation mutation and that transformation arg2-into = to blast crisis PMID 16461300 RID+UID VAL617 JAK2 HUMAN PAS pred = sociated arg1 = the presence arg1-of = the JAK2 VAL617˜PHE mutation arg2 = arg2-with = higher platelet activation PMID 16904848 RID+UID VAL617 JAK2 HUMAN PAS pred = transmit arg1 = that JAK2 VAL617˜PHE arg2 = signals arg2-from = ligand-activated TpoR or EpoR PMID 15863514 RID+UID VAL617 JAK2 HUMAN PAS pred = changes arg2 = conserved VAL617˜PHE arg2-in = the pseudokinase domain arg2-of = JAK2 that

185 Appendix E

Examples of extracted functional annotations of the category binding event

186 Table E.1: [Mined functional annotations of protein residues with information on binding events. The mined information correspond to 17 protein residues listed in MSDsite. The extracted information can be used for functional anno- tation and validation of predicted binding site in the database.

. RID+UID T199 CAH2 HUMAN Sentence ”The three-dimensional structures of azide-bound and sulfate-bound T199V CAIIs were de- termined by x-ray crystallographic methods at 2.25 and 2.4 A, respectively (final crystallo- graphic R factors are 0.173 and 0.174, respectively).” (PMID:8262987) PAS pred = determined arg1 = The three-dimensional structures arg1-of = [azide-bound and sulfate-bound THR199 VAL CAIIs]/BINDING arg2 = arg2-by = x-prot:ray crystallographic methods arg2-at = at 2.25 and 2.4 A ,respectively ( final crystallographic RID+UID R55 PPIA HUMAN Sentence ”On the basis of the structure, it is proposed that Arg55 hydrogen-bonds to the nitrogen to deconjugate the resonance of the prolyl amide bond and thus facilitates the cis-trans rotation.” (PMID:8652511) PAS pred = proposed arg2 = [that ARG55 hydrogen-bonds]/BINDING arg2-to = the nitrogen PAS pred = deconjugate arg1 = [that ARG55 hydrogen-bonds]/BINDING arg1-to = the nitrogen arg2 = the resonance arg2-of = the prolyl amide bond and RID+UID L255 PH4H HUMAN Sentence ”Only for the R252Q and L255V mutants were catalytically active tetramer and dimer re- covered and for R252G some dimer, i.e. 20% (R252Q, tetramer), 44% (L255V, tetramer) and 4.4% (R252G, dimer) of the activity for the respective wild-type (wt) forms.” (PMID:9799096) PAS pred = recovered arg1 = active tetramer and dimer arg2 = and arg2-for = [ARG252 GLY some dimer]/BINDING RID+UID Y156 HGXR TRIFO Sentence ”But the forces involved in recognizing the exocyclic C2-substituents of the purine ring, which involve the Tyr156 hydroxyl, Ile157 backbone carbonyl, and Asp163 side-chain carboxyl, may be weakened by the shifted conformation of the peptide backbone resulted from loss of the Glu11-Arg155 salt bridge.” (PMID:9843428) PAS pred = resulted arg1 = arg1-by = the shifted conformation arg1-of = the peptide backbone arg2 = arg2-from = loss arg2-of = [the GLU11 ARG155 salt bridge]/BINDING RID+UID K79 HGXR TOXGO Sentence ”The Leu78-Lys79 peptide bond in the active site adopts the cis configuration, which it must to bind PRPP or pyrophosphate.” (PMID:10545171) PAS pred = adopts arg1 = [The LEU78 LYS79 peptide bond]/BINDING arg1-in = the active site arg2 = the RID+UID G57 FLAV CLOBE

187 ... continuation of table E.1

Sentence ”In the Clostridium beijerinckii flavodoxin, the reduction of the flavin mononucleotide (FMN) cofactor is accompanied by a local conformation change in which the Gly57-Asp58 peptide bond ”flips” from primarily the unusual cis O-down conformation in the oxidized state to the trans O-up conformation such that a new hydrogen bond can be formed between the carbonyl group of Gly57 and the proton on N(5) of the neutral FMN semiquinone radical [Ludwig, M. L., Pattridge, K. A., Metzger, A. L., Dixon, M. M., Eren, M., Feng, Y., and Swenson, R. P. (1997) Biochemistry 36, 1259-1280].” (PMID:10353827) PAS pred = accompanied arg1 = ) cofactor arg2 = arg2-by = a local conformation change arg2-in = [which the GLY57 ASP58 peptide bond]/BINDING RID+UID D160 APX STRGR; M161 APX STRGR; G201 APX STRGR; R202 APX STRGR; F219 APX STRGR Sentence ”These studies allowed the tracing of the previously disordered region of the enzyme (Glu196- Arg202) and the identification of some of the functional groups of the enzyme that are involved in enzyme-substrate interactions (Asp160, Met161, Gly201, Arg202 and Phe219).” (PMID:10771423) PAS pred = involved arg1 = disordered region arg1-of = the enzyme ( GLU196 ARG202 ) and the identification arg1-of = some arg1-of = the functional groups arg1-of = the enzyme that arg2 = arg2-in = [enzyme-substrate interactions ( ASP160, MET161, GLY201, ARG202, PHE219)]/BINDING RID+UID I209 FIXL RHIME Sentence ”Interaction between the iron-bound O(2) and Ile209 was also observed in the resonance Raman spectra of RmFixLH as evidenced by the fact that the Fe-O(2) and Fe-CN stretching frequencies were shifted from 575 to 570 cm(-1) (Fe-O(2)), and 504 to 499 cm(-1), respectively, as the result of the replacement of Ile209 with an Ala residue.” (PMID:10926518) PAS pred = observed arg1 = Interaction arg1-between = [the iron-bound O(2) and ILE209]/BINDING arg2 = arg2-in = the resonance Raman spectra arg2-of = RmFixLH as

188 Appendix F

Examples of extracted functional annotations of active site residues

189 Table F.1: Identified catalytic triad residues from MEDLINE exraction. The listed sentences describe the mentioned protein residues as catalytic (co- mention with the term ”catalytic triad”), however, none of them are recorded in CSA, thus the identified information are novel data.

. RID+UID D44 TPP2 HUMAN, H264 TPP2 HUMAN, S449 EPHA3 HUMAN Sentence ”The amino acids forming the putative catalytic triad (Asp-44, His-264, Ser-449) as well as the conserved Asn-362, potentially stabilizing the transition state, were replaced by alanine and the mutated cDNAs were transfected into human embryonic kidney (HEK) 293 cells.” (PMID:12445476) PAS pred = forming arg1 = The amino acids arg2 = [the putative catalytic triad ( ASP44, HIS264, SER449)]/ENZ ACT RID+UID C25 CYSP1 CARCN, H159 CYSP1 CARCN, D175 CYSP1 CARCN Sentence ”The seven cysteine residues are aligned with those of papain and the catalytic triad (Cys25, His159, Asn175) of all cysteine peptidases of the papain family is conserved.” (PMID:10355634) PAS pred = aligned arg1 = The seven CYS+ arg2 = arg2-with = with those arg2-of = of papain and the catalytic triad ( CYS25 RID+UID C176 NADE MYCTU, E52 NADE MYCTU, K121 NADE MYCTU Sentence ”The residues forming the putative catalytic triad (Cys176, Glu52 and Lys121) were replaced by alanine; the mutated enzymes were expressed in the Escherichia coli Origami (DE3) strain and purified.” (PMID:15748981) PAS pred = forming arg1 = The residues arg2 = [the putative catalytic triad ( CYS176, GLU52, and LYS121)]/ENZ ACT RID+UID S1752 POLG BVDVS Sentence ”Our study provides experimental evidence that histidine at position 1658 and aspartic acid at position 1686 constitute together with the previously identified serine at position 1752 (S1752) the catalytic triad of the pestiviral NS3 serine protease.” (PMID:10915606) PAS pred = identified arg1 = arg1-with = the arg2 = [SER1752 ( S1752 ) the catalytic triad]/ENZ ACT arg2-of = the pestiviral NS3 serine protease. RID+UID D167 POLS SFV, H145 POLS SFV, S219 POLS SFV Sentence ”After this autoproteolytic cleavage, the free carboxylic group of Trp267 interacts with the catalytic triad (His145, Asp167 and Ser219) and inactivates the enzyme.” (PMID:18177892) PAS pred = interacts arg1 = the free carboxylic group arg1-of = TRP267 arg2 = arg2-with = [the catalytic triad ( HIS145, ASP167, and SER219)]/ENZ ACT RID+UID D122 ARY2 RAT Sentence ”Substitution of the catalytic triad Asp-122 with either alanine or asparagine resulted in the complete loss of protein structural integrity and catalytic activity.” (PMID:15209520) PAS pred = resulted arg1 = Substitution arg1-of = the catalytic triad ASP122 arg1-with = either alanine or asparagine arg2 = arg2-in = the complete loss arg2-of = [protein structural integrity and catalytic activity]/ENZ ACT

190 ... continuation of table F.1

RID+UID D156 LYPA1 HUMAN Sentence ”To investigate whether this bridging function occurs in vivo, two transgenic mouse lines were established expressing a muscle creatine kinase promoter-driven human LPL (hLPL) minigene mutated in the catalytic triad (Asp156 to Asn).” (PMID:9811888) PAS pred = mutated arg1 = ( hLPL ) minigene arg2 = arg2-in = [the catalytic triad (ASP156 ASN)]/ENZ ACT

191 Appendix G

Glossary

3D pattern – a recurrent residue triplet configuration (with k=2 or k=3 interaction of residues) within a dataset of protein structures.

arg – the argument of a PAS

BIND – the set of binding-related functional annotations of extracted protein residues, i.e. annotations are labelled as BINDING.

BINDING – a category in MAN, describing binding events of a protein residue.

CSA – a database of manually curated active sites with structure templates derived from PDB.

Contextual feature .

EC – Enzyme classification identifier.

ER – entity recognition.

ENZ – the set of enzyme-related functional annotations of extracted protein residues, i.e. annotations are labelled as ENZ ACT.

ENZ ACT – a category in MAN, describing enzyme-related information.

FA – a functional annotation; or the set of extracted protein residues with functional annotations.

FEAT – a categorisation scheme based on UniProtKB.

FN – a false negative.

FP – a false positive.

FT – a record in Uniprot data file with functional annotation.

Functional annotation – Information on biological function assigned to a protein residue.

GC – a manually annotated test set with abstract texts drawn from a random selection of UniProtKB citations.

GO – Gene Ontology.

MAN – a categorisation scheme based on manual analysis on MEDLINE.

MEDLINE – a database of citations and abstract texts from biomedical publications.

NP – a noun phrase is defined as a nominal sequence.

OLDFIELD – a non-redundant structure dataset of protein domains selected from PDB by sequence alignments.

OPR – a semantic relation between a residue, its source protein, and hosting organism; or the set of mined protein residues.

192 PAS – a data structure to accommodate the semantic relation between a predicate its arguments.

PDBID – PDB identifier.

PDB – the primary database of protein structure with spatial coordinates.

PMID – a PubMed identifier.

POS – a class of words, e.g. noun, verb, adjective, used for linguistic analysis.

PP – a prepositional phrase is defined as preposition + noun phrase.

pred – the predicate of a PAS.

Protein residue – a residue with known association to its source protein within a hosting organism (OPR).

RE – Relation extraction.

RID – a Residue identifier: residue name + residue protein sequence.

SCOP40 – a non-redundant protein structure dataset derived from SCOP.

SCOP – a derived protein structure database with manual classification of proteins based on structure similarities.

SITE – a record in the PDB data file denoting residues of a functional site.

Structure pattern – cf. 3D pattern.

TN – a true negative.

TP – a true positive.

TID – a Taxonomy identifier based on the NCBI Taxonomy guideline.

UID – a Protein identifier based on the UniProtKB guideline.

UniProtKB – a protein sequence database with manual annotations on protein residues.

VG – a verb group is sequence of verbs, auxiliaries, or verb modifiers.

VP – a verb phrase, consisting of a verb group + noun phrase.

XC – a cross-validation corpus based on references from UniProtKB.

chainID – a protein chain identifier in a PDB entry.

k=2, k=3 – a residue triplet configuration with two-way or three-way interaction.

resName – a residue name.

resSeq – a protein residue sequence identifier from a PDB entry.

seqIndex – a protein residue sequence identifier from a UniProtKB entry.

193