<<

IN SYSTEMS PHYSIOLOGY

Interpretation of the consequences of mutations in kinases:

combined use of and

Jose M.G. Izarzugaza, Martin Krallinger and

Journal Name: Frontiers in Physiology

ISSN: 1664-042X

Article type: Review Article

Received on: 23 May 2012

Accepted on: 23 Jul 2012

Provisional PDF published on: 23 Jul 2012

Frontiers website link: www.frontiersin.org

Citation: Izarzugaza JM, Krallinger M and Valencia A(2012) Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining. Front. Physio. 3:323. doi:10.3389/fphys.2012.00323

Article URL: http://www.frontiersin.org/Journal/Abstract.aspx?s=1092& name=systems%20physiology&ART_DOI=10.3389/fphys.2012.00323

(If clicking on the link doesn't work, try copying and pasting it into your browser.)

Copyright statement: © 2012 Izarzugaza, Krallinger and Valencia. This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.

This Provisional PDF corresponds to the article as it appeared upon acceptance, after rigorous peer-review. Fully formatted PDF and full text (HTML) versions will be made available soon.

Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining

Jose MG Izarzugaza, Martin Krallinger and Alfonso Valencia

Abstract Protein kinases play a crucial role in a plethora of significant physiological functions and a number of mutations in this superfamily have been reported in the literature to disrupt and/or function.

Computational and experimental research aims to discover the mechanistic connection between mu- tations in protein kinases and disease with the final aim of predicting the consequences of mutations on protein function and the subsequent phenotypic alterations.

In this article, we will review the possibilities and limitations of current computational methods for the prediction of the pathogenicity of mutations in the protein kinase superfamily. In particu- lar we will focus on the problem of benchmarking the predictions with independent gold standard datasets. We will propose a pipeline for the curation of mutations automatically extracted from the literature. Since many of these mutations are not included in the databases that are commonly used to train the computational methods to predict the pathogenicity of protein kinase mutations we propose them to build a valuable gold standard dataset in the benchmarking of a number of these predictors.

Finally, we will discuss how text mining approaches constitute a powerful tool for the interpreta- tion of the consequences of mutations in the context of disease genome analysis with particular focus on .

1 The Human Kinome

Protein kinases are a family of that catalyze the transfer of a phosphate from ATP to a ser- ine, threonine or tyrosine hydroxyl group in the target protein. Phosphorylation often implies activation or inhibition, alteration of interaction surfaces, and conformational changes, among the most common consequences. It is due to the importance of the processes regulated, that protein kinases gen- erally do not act alone but rather, they form part of a finely tuned signaling cascade that is strictly controlled spatiotemporally. Therefore, protein kinases are metaphorically referred to as the metabolic switches of the cell.

Protein kinases are one of the most ubiquitous families of signaling molecules in the human cell. The total number of genes encoding kinases has been a matter of discussion in the last decade and, for instance, in 1998 Wang estimated between 1000 and 2000 different human kinase genes [1]. With the completion of the human genome, the current estimate is that 518 genes protein kinases, corresponding to more than 2% of the total number of genes in the human genome [2].

All members of the superfamily share a characteristic domain - the protein kinase domain - that confers them the ability to phosphorylate other . Empirical studies suggest that the residues conforming the ATP binding site tend to be conserved and that phosphotransfer is carried out by a shared set of amino acids [3–7].

In spite of these similarities, experiments in yeast models [8, 9] suggest that although protein kinases individually present a remarkable substrate specificity, the superfamily as a whole is very promiscuous, phosphorylating a wide range of protein substrates. This observation, may be attributed to the different domain architectures present in the protein kinase superfamily. In addition to the aforementioned protein kinase domain committed to the general function of phosphorylation, a number of modular domains are combined to, for example, confer substrate specificity, to tightly control the activity of the enzyme or anchor the kinase to the membrane [10].

These differences in terms of functionality and domain architecture can be used to classify members of the protein kinase superfamily into different categories. Indeed, there are several different classifica- tions of kinases from the main model organisms: yeast [11], worm [12], fruit fly [8] and mouse [13]. The reference classification in humans is KinBase [2,14], which has also been incorporated into UniProt [15], albeit with minor modifications.

Mutations in the protein kinase superfamily

Due to their important regulatory function, a number of mutations in protein kinases have been associated with different human diseases [16], including cancer. For example, Greenman and coworkers [17] carried out the first large scale study of the variation of 518 human kinases in 210 samples of cancer tissues and cell-lines. Moreover, other high-throughput studies [18, 19] also yielding interesting information about the role that variation of human protein kinases plays in cancer. For a detailed review, refer to Baudot et al. [20].

The results from these high-throughput resequencing projects is often available through research publi- cations. However, in order to make the information more easily accessible, several efforts are devoted to compile, store, annotate and characterize mutations, including mutations in the protein kinase superfam- ily. Some examples are Uniprot [21], COSMIC [22], SAAPdb [23], MoKCa [24] and KinMutBase [25]. To- gether they constitute a powerful resource to understand disease association and the functional/structural properties of the mutations that affect human protein kinases.

2 Unfortunately, database curators are not able to store and annotate the vast amount of information provided by large-scale variation studies at the same pace it is generated. Mainly, because the process generally involves the manual inspection and curation of specific variation studies, which requires con- siderable resources. As a consequence, although growing in number, the mutations totally characterized and well-understood only represent a small fraction of all the human variome.

Methods to predict pathogenic mutations

In the section Mutations in the protein kinase superfamily we mentioned that high-throughput rese- quencing screenings represent a powerful set of techniques to discover large numbers of mutations. Of these, only a small fraction are causally implicated in disease onset and therefore, separating the wheat from the chaff is still a major challenge [20]. For a small subset of the new mutations discovered, exper- imental information is available regarding the relationship between the mutation and disease, and fora smaller number of cases the underlying biochemical mechanism is known. Little information is available for the remaining mutations. The requirement of a lot of investment, both in terms of time and money, means that it is not feasible to experimentally test the association of all these mutations to disease, and to characterize their functional effects. Nevertheless, this problem is very amenable to in silico predictors.

Cline and Karchin [26] wisely summarized the two different approaches as follows: “A bench biologist interested in whether a mutation of interest impacts the transcription of a gene might perform site- directed mutagenesis on genomic DNA, transfect mutated DNA into cell culture and use readouts of the gene’s transcriptional activity to measure changes with respect to wild type. In contrast, a bioinformatics approach typically involves computational analysis of the DNA sequence surrounding the mutation, pos- sibly supplemented with information from published bench experiments.”

This is just one example of the very different methods available to predict in silico the probability of a newly discovered mutation being implicated in disease. Different approaches have been developed in the last decade and several detailed reviews on this subject have been published [20, 26, 27].

These methodologies can be classified according to their underlying principles: Some methods make use of several features to identify relevant positions in a given protein, and hence, rules are derived to predict the pathogenicity of mutations. Another group of implementations assumes that evolutionarily conserved protein residues are important for protein structure, folding and function, whereby mutations in these residues are considered deleterious [28]. Variations on this principle lead to methods that predict deleterious mutations by assessing the changes in evolutionarily conserved PFAM motifs [29]. Further- more, a group of methodologies use protein structures to characterize substitutions that significantly destabilize the folded state. A growing number of systems integrate prior knowledge in the form of both sequence-based and structure-based features from a set of mutations (for which their characterization as pathogenic or neutral exists) to train an automatic system. Once trained, the system can infer the pathogenicity of new mutations automatically. Different machine-learning methods can be implemented depending on their individual needs. Among them, probably the most popular ones are: rule-based systems [30–32], decision trees [33], random forests [34,35], neural networks [36,37], Bayesian methods [38] and SVMs [35, 39–42]. In addition, some meta approaches have been implemented re- cently [43], for instance, Condel [44] integrates five of the most widely employed computational tools for sorting missense single nucleotide variations.

Methods also differ in the of the protein properties used to determine the pathogenicity of new mutations. Some of the predictors require sequence-oriented features that are easily applicable to any polymorphism. Recurrent examples of this category are: type, sequence conservation, do-

3 main type, functional annotations, post-translational modifications and so on. A second set of predictors calculate features that require a protein structure. Common examples to illustrate these are: secondary structure, solvent accessibility, flexibility, etc. The major drawback of these methodologies is that al- though they may increase the accuracy, the need for either an experimentally solved or a precisely modelled protein structure implies a loss of coverage. The number of features and their combinations is infinite. Moreover, features can also either be general or apply only to a defined subset of proteins, as is the membership to a kinase group [40, 45]

4 Method Description Further information Uniprot [46] General information about proteins, http://www.uniprot.org/ including human protein kinases PDB [47] Catalogue of protein structures, pro- http://www.rcsb.org/ tein kinases widely represented PDBsum [48] Annotation on protein structures http://www.ebi.ac.uk/pdbsum KinBase [2, 14] Hierarchical classification of protein http://kinase.com/kinbase/ kinases SwissVar [49] Detailed information about muta- http://swissvar.expasy.org/ tions present in Uniprot COSMIC [22] Catalogue of somatic mutations in http://www.sanger.ac.uk/perl/ cancer genetics/CGP/cosmic Ensembl [50] Infrastructure for the integrated an- http://www.ensembl.org notation on chordate and selected eukaryotic genomes dbSNP [51] Annotated catalogue of SNPs http://www.ncbi.nlm.nih.gov/ projects/SNP HapMap [52] Catalogue of common genetic vari- www.hapmap.org ants in the human genome 1000Genomes [53] Deep catalogue of human variations http://www.1000genomes.org/ derived from the next-generation se- quencing of 1000 people TCGA [54] The Cancer Genome Atlas is a col- http://cancergenome.nih.gov/ lection of genetic variations found in 20 different ICGC [55] The International Cancer Genome http://www.icgc.org Consortium project aims to a comprehensive description of ge- nomic, transcriptomic and epige- nomic changes in 50 tumor types and sub-types OMIM [56] Catalogue of Mendelian mutations http://www.ncbi.nlm.nih.gov/ known to cause disease omim SAAPdb [23] Calculation of the structural conse- http://www.bioinf.org.uk/ quences of mutations saap/db/ SNPeffect 2.0 [57] A database mapping molecular phe- http://snpeffect.switchlab. notypic effects of human non- org synonymous coding SNPs ModBase [58] Structural models of mutant pro- http://salilab.org/modbase teins TopoSNP [59] topoSNP: a topographic database http://gila.bioengr.uic.edu/ of non-synonymous single nucleotide snp/toposnp/ polymorphisms with and without known disease association MoKCa [24] Annotated catalogue of cancer- http://strubiol.icr.ac.uk/ associated mutations in protein ki- extra/mokca/ nases KinMutBase [25] Registry of disease-causing muta- http://bioinf.uta.fi/ tions in protein kinase domains KinMutBase

Table 1: Summary of resources providing information about kinases and mutations.

5 Method Main features Further information SIFT [28] Threshold-based, conservation http://sift.jcvi.org PMUT [60] Neural Network, sequence- and structure-based features http://mmb.pcb.ub.es/PMut SNPs3D [61] Support Vector Machine, structure-based features http://www.snps3d.org PANTHER [62] Threshold-based, conservation (PSEC) http://www.pantherdb.org/tools/csnpScoreForm.jsp Pfam LogRE [29] Threshold-based, probability of a PFAM domain to be pathogenic using a log-odds ratio LS-SNP [27] Support Vector Machine, sequence- and structure-based http://ls-snp.icm.jhu.edu/ls-snp-pdb features CanPredict [63] Combines SIFT, Pfam LogRE and Gene Ontology terms in http://research-public.gene.com/Research/ a single prediction genentech/canpredict SNAP [37] Neural Network, sequence- and structure-based features http://cubic.bioc.columbia.edu/services/snap Torkamani [40] Support Vector Machine, sequence- and structure-based features, kinase-specific MutaGeneSys [64] Whole-genome marker correlation dataset to identify asso- http://www.cs.columbia.edu/~jds1/MutaGeneSys ciation to causal SNPs in OMIM

6 stSNP [65] Intregates non-synonymous SNPs from dbSNP, structural http://ilyinlab.org/StSNP models from Modeller and KEGG pathways. Comparative native/mutant analysis. F-SNP [43] Metaserver, combines PolyPhen, SNPeffect2.0, SNPs3D, http://compbio.cs.queensu.ca/F-SNP LS-SNP. SNP&GO [39] Support Vector Machine, several sequence-derived features http://snps-and-go.biocomp.unibo.it/snps-and-go/ and information from Gene Ontology terms PolyPhen-2 [38] Bayesian classifier, sequence- and structure-based features http://genetics.bwh.harvard.edu/pph2 MuD [35] Random forest, sequence- and structure-based features http://mud.tau.ac.il CHASM [66] Random forest, sequence-based features http://wiki.chasmsoftware.org/index.php Mutation Assessor [30] Threshold-based, differential evolutionary conservation in http://mutationassessor.org subfamilies Condel [44] Metaserver, combines the output of other predictors http://bg.upf.edu/condel/ wKinMut [45, 67] Framework for the analysis of kinase mutations. Integrates http://wkinmut.bioinfo.cnio.es annotations, predictions and information from the litera- ture

Table 2: Summary of methods to predict the pathogenicity of mutations. Benchmarking prediction methods

In the previous section we discussed the differences between the various methods, both in terms of im- plementation and prediction features. Equally important are the differences found in the composition of the datasets used to train the methods. This is particularly relevant in the case of machine-learning ap- proaches. Machine learning approaches are developed in two independent consecutive steps: During the initial development phase, the developers aim to optimize the combination of features, internal parame- ters and prediction algorithms to obtain a trained classifier. In a later phase, blind tests are conducted to evaluate the performance simulating a more realistic scenario. Consequently, three separate datasets are needed: (i) a training dataset to allow the classifier to learn, (ii) a validation dataset to optimize the selection of parameters and (iii) an evaluation dataset to conduct blind tests to assess the expected performance of the classifier.

Consequently, the datasets used highly influence the overall performance of the prediction and, if not pondered cautiously might become a source of evaluation errors. Probably, the most common of them being overtraining as a result from the evaluation of the methodologies with mutations that have also been considered in the training dataset. In other words, if a predictor were evaluated using a test set whose correct answers the method had previously been provided with, this may yield unfair over-estimation of the prediction capability. An extension of this problem, especially if the features considered predict at the protein level, is that mutations occurring in the same protein or closely related homologs should not span two different datasets.

The selection of a benchmark dataset that is fair and does not lead to artifacts is not a trivial task [68] and clean datasets that were not used in the development of any of the methods are required. Following a similar approach to those in the detection of bio-entities from the literature (BioCreative), protein structure (CASP) and protein interaction prediction (CAPRI), a successful recent example is CAGI (http://genomeinterpretation.org). In summary, CAGI is intended to assess a battery of computa- tional methods for predicting the phenotypic impacts of genome variation. Participants are provided a number of different sets of genetic variants and are expected to make predictions of resulting, molecular, cellular or organismal phenotype. These predictions are later on evaluated by independent assessors against experimental characterizations.

Although CAGI constitutes an undoubtedly powerful tool to provide insights on the performance of state-of-the-art methodologies, the major drawback is that provided datasets are gathered from very specialized projects, and consequently are seldom universally applicably to all methodologies, which con- sequently, limits the benchmark. An example of the previous would be the intrinsic limitation to predict mutations outside the protein kinase superfamily for kinase-specific methodologies.

Complementary to the CAGI experiment, current text mining methodologies enable the generation of clean sets of experimentally validated mutation mentions from the literature. Those mutations that were not recorded in the databases used to provide the training and evaluation datasets are of special interest. Here we propose a pipeline for the curation of mutations automatically extracted from the literature and their use as a gold standard in the benchmarking of pathogenicity predictors. We will describe this approach thoroughly in the following sections.

Mining kinase mutations from the literature

Previously, we discussed how the efforts of database curators to store and annotate mutations can hardly keep the pace of the vast amount of information generated by current large-scale variation studies. To bridge this growing gap, automatic extraction of entities and their relationships from the existing litera- ture can be applied. This includes text mining techniques such as regular expressions, pattern recognition

7 and natural language processing, among others. Indeed, these approaches have been successfully applied to other fields of research, for instance for the automatic extraction of protein-protein interactions [69,70] and in the annotation of genes and proteins [71, 72]. Despite the success of these methods, it must be born in mind that this technology does not aim to replace manual curation and validation. Rather, text mining approaches are better understood as systematic tools to assist the efforts of human curators by helping them to find information, prioritize documents and highlight potentially relevant items [71,73,74].

Here we will use our recently published pipeline for extracting mutation mentions in protein kinases from the literature, SNP2L [75], as an example of a typical text mining workflow. The pipeline (Figure 1) integrates article retrieval, detection of mutations and proteins in the corresponding article, correct mutation-protein association and, finally, validation of the results. To the best of our knowledge there is currently no pipeline similar to the one presented here. Two main aspects make our pipeline unique. First, our system is specifically designed to extract mutations occurring in the protein kinase superfamily. Second, we perform an additional filtering step to ensure the quality of the extracted mutations as we will disclose in the following sections.

Article selection (triage): constructing a text mining corpus Following a common approach in text mining, we tested SNP2L with two different datasets: One consti- tuted by the whole collection of Pubmed abstracts and the other by a collection of either manually or automatically selected full-text articles. In order to construct the corpus, full-text articles were automat- ically downloaded using an in-house retrieval system [71] prioritized under three different criteria: 1. Relevance of the abstract: information contained in the corresponding abstracts such as the mention of mutations, mention of human kinases and a combination of keywords (including ‘human kinase mutation’). 2. A priori relevance of the full text articles: extracting all references in Pubmed for human kinases contained in multiple databases (e.g., SwissProt, MINT and IntAct). 3. Relevance of the journal: based on analyzing a fraction of mutation-mentioning abstracts of each journal and prioritizing a set of journals (and thus their articles) to retrieve their full-text arti- cles. This set consisted of the following journals: American Journal of Human Genetics, European Journal of Human Genetics, Human Genetics, Human Mutation and Human Molecular Genetics. Before proceeding to the next step, all articles should be split in sentences using a sentence boundary detection system [71].

Entity recognition: Mutations and protein kinases The consistent nomenclature used to describe mutations in the literature makes these entities especially amenable to this type of approach and accordingly, a growing number of such methods have been de- scribed in the literature over the years. A summary of several of these literature-mining tools to extract information on mutations is presented in Table 3. In the example discussed here, we used Mutation- Finder [83] for the initial extraction of single amino acid substitutions. MutationFinder constitutes a valuable tool to detect the mention of mutations in a given set of manuscripts and it relies on language expressions used to describe mutation events. MutationFinder is very competitive for recall and preci- sion when compared to other strategies [49], and it has been evaluated using a manually-generated gold standard collection of abstracts.

After recognizing all the mutations mentioned in the text, we attempted to identify all human protein

8 Figure 1: SNP2L Pipeline as an example of a typical automatic method to extract mutation mentions from the literature. The pipeline integrates article retrieval, detection of mutations and proteins in the corresponding article, correct mutation-protein association and, finally, validation of the results. kinases co-mentioned with them in the same document. Existing systems that try to link mentions of genes and proteins to database identifiers generally rely on approaches that compare the names appearing in the text to gene names or aliases contained in database records. The actual task of determining the exact database record for a gene/protein mention is commonly referred to as gene mention grounding or normalization, and has been evaluated in the second BioCreative community challenge, illustrating that dictionary look-up approaches can obtain competitive results for this purpose [84]. Following this line, we constructed a lexicon specifically for human protein kinases, derived from gene and protein symbols, names and aliases contained in the UniProt database (see figure 1, Get names, symbols and aliases). Because this gene/protein lexicon did not capture all representative typographical variants of a given name, we used a rule-based approach and heuristics for generating typographical variants for the kinase lexicon entries. With this respect, the alternative use of hyphens, capitalization (upper-case and capitalized names) and different word order variants were captured. The gene/protein lexicon was filtered to eliminate highly ambiguous names through comparison with a stop word list and by, after an initial look-up step, checking manually potential outlier names that show a very high mention frequency.

9 Method Main Features MEMA [76] Regular expressions, gene and protein mentions, co-mention prox- imity, OMIM validation. MuteXt [77] Regular expressions, GPCR and NR mentions detection, co- mention proximity, sequence check. Yip [49] Regular expressions, protein mentions detection, SwissProt vali- dation, sequence check. MutationGraB [78] Regular expressions, protein mentions detection, graph shorted distance, sequence check. MutationMiner [79] Regular expressions, protein mentions detection, sentence co- mention. MuGeX [80] Regular expressions, protein mentions, protein and DNA mutation disambiguation. VTag [81] Machine learning detection of acquired sequence variation men- tions detection (mutations, translocations and deletions). OSIRIS [82] Detection of human gene variations corresponding to SNPs. MutationFinder [83] Regular expressions and patterns, protein mutations mentions de- tection, complex language expressions.

Table 3: Summary of text mining implementations for mutation extraction.

The extended and pruned human kinase lexicon was then used for the detection of corresponding mentions in our document collections containing mutation mentions. As a given name can correspond to different records (ambiguity), both at the level of human genes as well as in case of genes from different species sharing the same name, we calculated for each article, two different scores reflecting a) the contextual similarity of the article to the reference (UniProt) protein record and b) the overall association of the article to human species terms from the total set of tagged species terms. A conceivable alternative would be to simply apply very strict protein-organism co-mention criteria based on relative textual distances, which is rather problematic in case of human proteins were often the organism source is not explicitly stated.

Mutation-sequence linking The next step is to link mutation mentions with their corresponding human kinases. This step would be trivial if a single protein was mentioned per article, however, for most of the articles this is not the case and more than one protein is mentioned per article. A reasonable solution would be to check the existence of the amino acid at the specified position for each mutation mention-protein combination. In addition to this basic sequence look-up validation method additional mutation mapping strategies could be implemented. They should consider errors resulting from the wrong detection of the directionality of the extracted mutation mention (using the wild type as mutant residue and viceversa) and inconsistencies and alternative sequence counting between the article and the kinase sequence. For example:

- Sliding window algorithms that look for relative positions of mutations (pattern) rather than exact position co-occurrences. With this approach, mutation mentions would be scanned looking for positions relative to the starting one attending to the distance between all the mutations in the same abstract. The strength of this approach is that it is able to deal with alternative sequence coordinates. There are many examples in the literature: Mutations F175P, R178L and Y530L in the proto-oncogene tyrosine- protein kinase Src, are mentioned in the considered article (PMID 2108315) as F172P, R175L and Y527F respectively. Since the probability of finding simple patterns by chance can be high in some trivial cases, it is reasonable to consider only those cases where a minimum number of mutated positions (3 in our example) could be detected.

10 - Bidirectional mutation to sequence position mapping. Either the wild type or the mutant residue of an extracted mutation mention might be accepted in the corresponding sequence position. - Pro-peptides and mature protein mutation mapping. In order to allow alternative residue counting due to the presence of a signal peptide, a displacement equal to the length of the corresponding signal peptide might be allowed. - Methionine cleavage: The mutation mapping might be carried out taking into consideration the possi- bility of neglecting the N-terminal methionine.

Using the literature to generate a benchmark dataset The main focus of this article has been the construction of a gold standard dataset to benchmark pre- diction methods. Following this thread of reasoning, mutations already present in common databases are discarded, while new ones form the benchmark dataset. This procedure will ensure a dataset that enables fair comparison and is less prone to overestimation of the classifiers’ performance as we discussed previously in the Benchmarking prediction methods section.

In spite of constituting a powerful tool for the extraction of knowledge from the literature, text min- ing approaches to recover kinase mutations still have some limitations in terms of recall and a number mutations escape detection by even the most accurate state-of-the-art algorithms. Among the challeng- ing aspects in this respect are the detection of mutations that are described in additional materials or contained in tables and figures. This is because they can not easily be converted efficiently to plain text. Another key issue is the appropriate detection of the kinase mentions, which can be referred to through a range of different typographical variations and aliases, of which text mining approaches can only cover some. To this issue one also needs to add the underlying limitations in terms of recall of the mutation extraction process [83] and inconsistencies of sequence descriptions in reference databases as compared to those examined in scientific articles.

Using the literature to understand the consequences of mutation From a parallel perspective, text mining approaches can be used to enhance our understanding of both new and existing mutations. Text mining approaches output mutations extracted from the literature along with all their contextual information. Pointers to the relevant literature are provided, these include: experimental conditions, organism or population subtypes, information regarding observed phenotypes including association to disease, or in a best case scenario, the underlying biochemical mechanisms.

This information can help to interpret the consequences of mutations and is often complementary to the valuable clues provided by the methods to predict the pathogenicity of mutations. Indeed, the emerging trend in the field is to integrate information from diverse sources [43, 44], as we have done recently with the development of wKinMut (http://wkinmut.bioinfo.cnio.es) to help in the interpretation of mu- tations in the protein kinase superfamily.

In addition to the predictions of pathogenicity directly from our in-house classifier [45] and the val- ues of the features used in the classification, wKinMut combines information from different external sources to help in the interpretation of the prediction. These include the results from other classifiers focusing on different aspects of mutation pathogenicity (SIFT [28], MutationAssessor [30]), the repre- sentation of the mutation in the context of its three-dimensional structure and records of the mutation in other databases such as SAAPdb [23], Uniprot [49], COSMIC [22] and KinMutBase [25]. Two text mining resources complement the framework: iHop [85] a literature mining system to extract gene-gene and protein-protein interactions and SNP2L [75] whose capabilities to detect mutation mentions from the literature have been described thoroughly here.

In summary, wKinMut can be useful to predict the pathogenicity of novel mutations and to interpret the

11 biochemical mechanisms leading to pathogenicity and it can be applied to the interpretation of genomes from cancer patients.

Overview and Summary

Current research aims to discover the mechanistic connection between mutations and disease. We focused on the protein kinase superfamily due to the enormous wealth of mentions in the literature associating different diseases, including cancer, with mutations in members of this superfamily.

In this article we have reviewed the different possibilities and limitations of state-of-the-art computa- tional methods for the prediction of the pathogenicity of mutations and we have discussed the difficulties that arise to benchmark and evaluate the performance of the classifiers. We have proposed our recently published pipeline, SNP2L, for the automatic extraction and curation of mentions in the literature to col- lect a gold standard dataset that might be used in the benchmarking of the different predictors. Finally, we have introduced wKinMut as an example the integration of text mining with prediction methodologies to help in the interpretation of the consequences of mutations in the context of disease genome analysis with particular focus on cancer. We think that such applications might be of interest in the interpretation of patient genomes in the emerging field of personalized/stratified medicine in, hopefully, a near future.

References

[1] J Y Wang. Protein kinases entering the information age. J Biomed Sci, 5(2):73, Jan 1998. [2] G Manning, D B Whyte, R Martinez, T Hunter, and S Sudarsanam. The protein kinase complement of the human genome. , 298(5600):1912–34, Dec 2002. [3] James D R Knight, Bin Qian, David Baker, and Rashmi Kothary. Conservation, variability and the modeling of active protein kinases. PLoS ONE, 2(10):e982, Jan 2007. [4] Gonzalo L´opez, Alfonso Valencia, and Michael L Tress. Firedb–a database of functionally important residues from proteins of known structure. Nucleic Acids Res, 35(Database issue):D219–23, Jan 2007. [5] Sarah L Kinnings and Richard M Jackson. Binding site similarity analysis for the functional classi- fication of the protein kinase family. Journal of chemical information and modeling, Jan 2009. [6] Duangrudee Tanramluk, Adrian Schreyer, William R Pitt, and Tom L Blundell. On the origins of enzyme inhibitor selectivity and promiscuity: a case study of protein kinase binding to staurosporine. Chem Biol Drug Des, 74(1):16–24, Jul 2009. [7] Eric D Scheeff and Philip E Bourne. Structural of the protein kinase-like superfamily. PLoS Comput Biol, 1(5):e49, Oct 2005. [8] Gerard Manning, Gregory D Plowman, Tony Hunter, and Sucha Sudarsanam. Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci, 27(10):514–20, Oct 2002. [9] Jeffrey A Ubersax, Erika L Woodbury, Phuong N Quang, Maria Paraz, Justin D Blethrow, Kavita Shah, Kevan M Shokat, and David O Morgan. Targets of the -dependent kinase cdk1. Nature, 425(6960):859–64, Oct 2003. [10] Robert D Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E Pollington, O Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L L Sonnham- mer, Sean R Eddy, and . The pfam protein families database. Nucleic Acids Res, 38(Database issue):D211–22, Jan 2010.

12 [11] T Hunter and G D Plowman. The protein kinases of budding yeast: six score and more. Trends Biochem Sci, 22(1):18–22, Jan 1997.

[12] Gerard Manning. Genomic overview of protein kinases. WormBook, pages 1–19, Jan 2005. [13] Sean Caenepeel, Glen Charydczak, Sucha Sudarsanam, Tony Hunter, and Gerard Manning. The mouse kinome: discovery and comparative of all mouse protein kinases. Proc Natl Acad Sci USA, 101(32):11707–12, Aug 2004. [14] Diego Miranda-Saavedra and Geoffrey J Barton. Classification and functional annotation of eukary- otic protein kinases. Proteins, 68(4):893–914, Sep 2007. [15] , , Cathy H Wu, Winona C Barker, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Magrane, Maria J Martin, Darren A Natale, Claire O’Donovan, Nicole Redaschi, and Lai-Su L Yeh. The universal protein resource (uniprot). Nucleic Acids Res, 33(Database issue):D154–9, Jan 2005.

[16] I Shchemelinin, L Sefc, and E Necas. Protein kinases, their function and implication in cancer and other diseases. Folia Biol (Praha), 52(3):81–100, Jan 2006. [17] Christopher Greenman, Philip Stephens, Raffaella Smith, Gillian L Dalgliesh, Christopher Hunter, Graham Bignell, Helen Davies, Jon Teague, Adam Butler, Claire Stevens, Sarah Edkins, Sarah O’Meara, Imre Vastrik, Esther E Schmidt, Tim Avis, Syd Barthorpe, Gurpreet Bhamra, Gemma Buck, Bhudipa Choudhury, Jody Clements, Jennifer Cole, Ed Dicks, Simon Forbes, Kris Gray, Kelly Halliday, Rachel Harrison, Katy Hills, Jon Hinton, Andy Jenkinson, David Jones, Andy Menzies, Tatiana Mironenko, Janet Perry, Keiran Raine, Dave Richardson, Rebecca Shepherd, Alexandra Small, Calli Tofts, Jennifer Varian, Tony Webb, Sofie West, Sara Widaa, Andy Yates, Daniel P Cahill, David N Louis, Peter Goldstraw, Andrew G Nicholson, Francis Brasseur, Leendert Looijenga, Barbara L Weber, Yoke-Eng Chiew, Anna DeFazio, Mel F Greaves, Anthony R Green, Peter Camp- bell, , Douglas F Easton, Georgia Chenevix-Trench, Min-Han Tan, Sok Kean Khoo, Bin Tean Teh, Siu Tsan Yuen, Suet Yi Leung, Richard Wooster, P Andrew Futreal, and Michael R Stratton. Patterns of somatic mutation in human cancer genomes. Nature, 446(7132):153–8, Mar 2007.

[18] Laura D Wood, D Williams Parsons, SiˆanJones, Jimmy Lin, Tobias Sj¨oblom,Rebecca J Leary, Dong Shen, Simina M Boca, Thomas D Barber, Janine Ptak, Natalie Silliman, Steve Szabo, Zoltan Dezso, Vadim Ustyanksky, Tatiana Nikolskaya, Yuri Nikolsky, Rachel Karchin, Paul A Wilson, Joshua S Kaminker, Zemin Zhang, Randal Croshaw, Joseph Willis, Dawn Dawson, Michail Shipitsin, James K V Willson, Saraswati Sukumar, Kornelia Polyak, Ben Ho Park, Charit L Pethiyagoda, P V Krishna Pant, Dennis G Ballinger, Andrew B Sparks, James Hartigan, Douglas R Smith, Erick Suh, Nickolas Papadopoulos, Phillip Buckhaults, Sanford D Markowitz, Giovanni Parmigiani, Kenneth W Kinzler, Victor E Velculescu, and . The genomic landscapes of human breast and colorectal cancers. Science, 318(5853):1108–13, Nov 2007. [19] Tobias Sj¨oblom,SiˆanJones, Laura D Wood, D Williams Parsons, Jimmy Lin, Thomas D Barber, Diana Mandelker, Rebecca J Leary, Janine Ptak, Natalie Silliman, Steve Szabo, Phillip Buckhaults, Christopher Farrell, Paul Meeh, Sanford D Markowitz, Joseph Willis, Dawn Dawson, James K V Willson, Adi F Gazdar, James Hartigan, Leo Wu, Changsheng Liu, Giovanni Parmigiani, Ben Ho Park, Kurtis E Bachman, Nickolas Papadopoulos, Bert Vogelstein, Kenneth W Kinzler, and Vic- tor E Velculescu. The consensus coding sequences of human breast and colorectal cancers. Science, 314(5797):268–74, Oct 2006.

[20] A Baudot, F Real, J Izarzugaza, and A Valencia. From cancer genomes to cancer models: bridging the gaps. EMBO Rep, Mar 2009.

13 [21] Yum L Yip, Maria Famiglietti, Arnaud Gos, Paula D Duek, Fabrice P A David, Alain Gateau, and Amos Bairoch. Annotating single amino acid polymorphisms in the uniprot/swiss-prot knowledge- base. Hum Mutat, 29(3):361–6, Mar 2008.

[22] S Bamford, E Dawson, Simon Forbes, J Clements, R Pettett, A Dogan, A Flanagan, J Teague, P Andrew Futreal, M R Stratton, and R Wooster. The cosmic (catalogue of somatic mutations in cancer) database and website. Br J Cancer, 91(2):355–8, Jul 2004. [23] J Hurst, L McMillan, C Porter, J Allen, A Fakorede, and A Martin. The saapdb web resource: A large-scale structural analysis of mutant proteins. Hum Mutat, Feb 2009. [24] Christopher J Richardson, Qiong Gao, Costas Mitsopoulous, Marketa Zvelebil, Laurence H Pearl, and Frances M G Pearl. Mokca database–mutations of kinases in cancer. Nucleic Acids Res, 37(Database issue):D824–31, Jan 2009. [25] Csaba Ortutay, Jouni V¨aliaho,Kaj Stenberg, and Mauno Vihinen. Kinmutbase: a registry of disease- causing mutations in protein kinase domains. Hum Mutat, 25(5):435–42, May 2005. [26] Melissa Cline and Rachel Karchin. Using bioinformatics to predict the functional impact of snvs. Bioinformatics, Dec 2010. [27] Rachel Karchin. Next generation tools for the annotation of human snps. Brief Bioinformatics, 10(1):35–52, Jan 2009. [28] Pauline C Ng and S Henikoff. Predicting deleterious amino acid substitutions. Genome Res, 11(5):863–74, May 2001. [29] Robert J Clifford, Michael N Edmonson, Cu Nguyen, and Kenneth H Buetow. Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms. Bioinformatics, 20(7):1006–14, May 2004. [30] Boris Reva, Yevgeniy Antipin, and . Predicting the functional impact of protein muta- tions: application to cancer genomics. Nucleic Acids Res, Jul 2011. [31] Vasily Ramensky, , and Shamil Sunyaev. Human non-synonymous snps: server and survey. Nucleic Acids Res, 30(17):3894–900, Sep 2002. [32] Z Wang and John Moult. Snps, protein structure, and disease. Hum Mutat, 17(4):263–70, Apr 2001. [33] V G Krishnan and D R Westhead. A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics, 19(17):2199–209, Nov 2003.

[34] Joshua S Kaminker, Yan Zhang, Allison Waugh, Peter M Haverty, Brock Peters, Dragan Sebisanovic, Jeremy Stinson, William F Forrest, J Fernando Bazan, Somasekar Seshagiri, and Zemin Zhang. Distinguishing cancer-associated missense mutations from common polymorphisms. Cancer Res, 67(2):465–73, Jan 2007.

[35] Gilad Wainreb, Haim Ashkenazy, Yana Bromberg, Alina Starovolsky-Shitrit, Turkan Haliloglu, Ey- tan Ruppin, Karen B Avraham, , and Nir Ben-Tal. Mud: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res, 38 Suppl:W523–8, Jul 2010. [36] Carles Ferrer-Costa, Modesto Orozco, and Xavier de la Cruz. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J Mol Biol, 315(4):771–86, Jan 2002.

14 [37] Yana Bromberg and Burkhard Rost. Snap: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res, 35(11):3823–35, Jan 2007.

[38] Ivan A Adzhubei, Steffen Schmidt, Leonid Peshkin, Vasily E Ramensky, Anna Gerasimova, Peer Bork, Alexey S Kondrashov, and Shamil R Sunyaev. A method and server for predicting damaging missense mutations. Nat Methods, 7(4):248–9, Apr 2010. [39] Remo Calabrese, Emidio Capriotti, Piero Fariselli, Pier Luigi Martelli, and . Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat, 30(8):1237–44, Aug 2009. [40] Ali Torkamani and Nicholas J Schork. Accurate prediction of deleterious protein kinase polymor- phisms. Bioinformatics, 23(21):2918–25, Nov 2007. [41] Peng Yue, Zhaolong Li, and John Moult. Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol, 353(2):459–73, Oct 2005.

[42] Rachel Karchin, Mark Diekhans, Libusha Kelly, Daryl J Thomas, Ursula Pieper, Narayanan Eswar, , and Andrej Sali. Ls-snp: large-scale annotation of coding non-synonymous snps based on multiple information sources. Bioinformatics, 21(12):2814–20, Jun 2005. [43] Phil Hyoun Lee and Hagit Shatkay. F-snp: computationally predicted functional snps for disease association studies. Nucleic Acids Res, 36(Database issue):D820–4, Jan 2008. [44] Abel Gonz´alez-P´erezand Nuria L´opez-Bigas. Improving the assessment of the outcome of nonsyn- onymous snvs with a consensus deleteriousness score, condel. Am J Hum Genet, 88(4):440–9, Apr 2011. [45] Jose MG Izarzugaza, Angela Pozo, Miguel Vazquez, and Alfonso Valencia. Prioritization of pathogenic mutations in the protein kinase superfamily. BMC Genomics, Submitted, 2012. [46] UniProt Consortium. The universal protein resource (uniprot). Nucleic Acids Res, 35(Database issue):D193–7, Jan 2007. [47] H M Berman, J Westbrook, Z Feng, G Gilliland, T N Bhat, H Weissig, Ilya N Shindyalov, and Philip E Bourne. The . Nucleic Acids Res, 28(1):235–42, Jan 2000. [48] Roman A Laskowski, Victor V Chistyakov, and Janet M Thornton. Pdbsum more: new summaries and analyses of the known 3d structures of proteins and nucleic acids. Nucleic Acids Res, 33(Database issue):D266–8, Jan 2005. [49] Yum Lina Yip, Nathalie Lachenal, Violaine Pillet, and Anne-Lise Veuthey. Retrieving mutation- specific information for human proteins in uniprot/swiss-prot knowledgebase. J Bioinform Comput Biol, 5(6):1215–31, Dec 2007. [50] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Simon Brent, Yuan Chen, Peter Clapham, Guy Coates, Susan Fairley, Stephen Fitzgerald, Leo Gordon, Maurice Hendrix, Thibaut Hourlier, Nathan Johnson, Andreas K¨ah¨ari,Damian Keefe, Stephen Keenan, Rhoda Kinsella, Fe- lix Kokocinski, Eugene Kulesha, Pontus Larsson, Ian Longden, William McLaren, Bert Overduin, Bethan Pritchard, Harpreet Singh Riat, Daniel Rios, Graham R S Ritchie, Magali Ruffier, Michael Schuster, Daniel Sobral, Giulietta Spudich, Y Amy Tang, Stephen Trevanion, Jana Vandrovcova, Albert J Vilella, Simon White, Steven P Wilder, Amonida Zadissa, Jorge Zamora, Bronwen L Aken, Ewan Birney, Fiona Cunningham, Ian Dunham, Richard Durbin, Xos´eM Fern´andez-Suarez,Javier Herrero, Tim J P Hubbard, Anne Parker, Glenn Proctor, Jan Vogel, and Stephen M J Searle. Ensembl 2011. Nucleic Acids Res, 39(Database issue):D800–6, Jan 2011.

15 [51] S T Sherry, M H Ward, M Kholodov, J Baker, L Phan, E M Smigielski, and K Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res, 29(1):308–11, Jan 2001.

[52] International HapMap 3 Consortium, David M Altshuler, Richard A Gibbs, Leena Peltonen, David M Altshuler, Richard A Gibbs, Leena Peltonen, Emmanouil Dermitzakis, Stephen F Schaffner, Fuli Yu, Leena Peltonen, Emmanouil Dermitzakis, Penelope E Bonnen, David M Altshuler, Richard A Gibbs, Paul I W de Bakker, Panos Deloukas, Stacey B Gabriel, Rhian Gwilliam, Sarah Hunt, Michael Inouye, Xiaoming Jia, Aarno Palotie, Melissa Parkin, Pamela Whittaker, Fuli Yu, Kyle Chang, Alicia Hawes, Lora R Lewis, Yanru Ren, David Wheeler, Richard A Gibbs, Donna Marie Muzny, Chris Barnes, Katayoon Darvishi, Matthew Hurles, Joshua M Korn, Kati Kristiansson, Charles Lee, Steven A McCarrol, James Nemesh, Emmanouil Dermitzakis, Alon Keinan, Stephen B Montgomery, Samuela Pollack, Alkes L Price, Nicole Soranzo, Penelope E Bonnen, Richard A Gibbs, Claudia Gonzaga-Jauregui, Alon Keinan, Alkes L Price, Fuli Yu, Verneri Anttila, Wendy Brodeur, Mark J Daly, Stephen Leslie, Gil McVean, Loukas Moutsianas, Huy Nguyen, Stephen F Schaffner, Qingrun Zhang, Mohammed J R Ghori, Ralph McGinnis, William McLaren, Samuela Pollack, Alkes L Price, Stephen F Schaffner, Fumihiko Takeuchi, Sharon R Grossman, Ilya Shlyakhter, Elizabeth B Hostet- ter, Pardis C Sabeti, Clement A Adebamowo, Morris W Foster, Deborah R Gordon, Julio Licinio, Maria Cristina Manca, Patricia A Marshall, Ichiro Matsuda, Duncan Ngare, Vivian Ota Wang, Deepa Reddy, Charles N Rotimi, Charmaine D Royal, Richard R Sharp, Changqing Zeng, Lisa D Brooks, and Jean E McEwen. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52–8, Sep 2010. [53] Consortium, Richard M Durbin, Gon¸calo R Abecasis, David L Alt- shuler, Adam Auton, Lisa D Brooks, Richard M Durbin, Richard A Gibbs, Matt E Hurles, and Gil A McVean. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–73, Oct 2010.

[54] Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353):609–15, Jun 2011. [55] International Cancer Genome Consortium, Thomas J Hudson, Warwick Anderson, Axel Artez, Anna D Barker, Cindy Bell, Rosa R Bernab´e,M K Bhan, Fabien Calvo, Iiro Eerola, Daniela S Gerhard, Alan Guttmacher, Mark Guyer, Fiona M Hemsley, Jennifer L Jennings, David Kerr, Peter Klatt, Patrik Kolar, Jun Kusada, David P Lane, Frank Laplace, Lu Youyong, Gerd Nettekoven, Brad Ozenberger, Jane Peterson, T S Rao, Jacques Remacle, Alan J Schafer, Tatsuhiro Shibata, Michael R Stratton, Joseph G Vockley, Koichi Watanabe, Huanming Yang, Matthew M F Yuen, Bartha M Knoppers, Martin Bobrow, Anne Cambon-Thomsen, Lynn G Dressler, Stephanie O M Dyke, Yann Joly, Kazuto Kato, Karen L Kennedy, Pilar Nicol´as,Michael J Parker, Emmanuelle Rial-Sebbag, Carlos M Romeo-Casabona, Kenna M Shaw, Susan Wallace, Georgia L Wiesner, Nikolajs Zeps, Peter Lichter, Andrew V Biankin, Christian Chabannon, Lynda Chin, Bruno Cl´ement, Enrique de Alava, Fran¸coiseDegos, Martin L Ferguson, Peter Geary, D Neil Hayes, Thomas J Hudson, Amber L Johns, Arek Kasprzyk, Hidewaki Nakagawa, Robert Penny, Miguel A Piris, Rajiv Sarin, Aldo Scarpa, Tat- suhiro Shibata, Marc van de Vijver, P Andrew Futreal, Hiroyuki Aburatani, M´onicaBay´es,David D L Botwell, Peter J Campbell, Xavier Estivill, Daniela S Gerhard, Sean M Grimmond, Ivo Gut, Martin Hirst, Carlos L´opez-Ot´ın,Partha Majumder, Marco Marra, John D McPherson, Hidewaki Nakagawa, Zemin Ning, Xose S Puente, Yijun Ruan, Tatsuhiro Shibata, Michael R Stratton, Hen- drik G Stunnenberg, Harold Swerdlow, Victor E Velculescu, Richard K Wilson, Hong H Xue, Liu Yang, Paul T Spellman, Gary D Bader, Paul C Boutros, Peter J Campbell, Paul Flicek, Gad Getz, Roderic Guig´o,Guangwu Guo, David Haussler, Simon Heath, Tim J Hubbard, Tao Jiang, Steven M Jones, Qibin Li, Nuria L´opez-Bigas, Ruibang Luo, Lakshmi Muthuswamy, B F Francis Ouellette, John V Pearson, Xose S Puente, Victor Quesada, Benjamin J Raphael, Chris Sander, Tatsuhiro Shibata, Terence P Speed, Lincoln D Stein, Joshua M Stuart, Jon W Teague, Yasushi Totoki, Tat- suhiko Tsunoda, Alfonso Valencia, David A Wheeler, Honglong Wu, Shancen Zhao, Guangyu Zhou,

16 Lincoln D Stein, Roderic Guig´o,Tim J Hubbard, Yann Joly, Steven M Jones, Arek Kasprzyk, Mark Lathrop, Nuria L´opez-Bigas, B F Francis Ouellette, Paul T Spellman, Jon W Teague, Gilles Thomas, Alfonso Valencia, Teruhiko Yoshida, Karen L Kennedy, Myles Axton, Stephanie O M Dyke, P An- drew Futreal, Daniela S Gerhard, Chris Gunter, Mark Guyer, Thomas J Hudson, John D McPherson, Linda J Miller, Brad Ozenberger, Kenna M Shaw, Arek Kasprzyk, Lincoln D Stein, Junjun Zhang, Syed A Haider, Jianxin Wang, Christina K Yung, Anthony Cros, Anthony Cross, Yong Liang, Sara- vanamuttu Gnaneshan, Jonathan Guberman, Jack Hsu, Martin Bobrow, Don R C Chalmers, Karl W Hasel, Yann Joly, Terry S H Kaan, Karen L Kennedy, Bartha M Knoppers, William W Lowrance, Tohru Masui, Pilar Nicol´as,Emmanuelle Rial-Sebbag, Laura Lyman Rodriguez, Catherine Vergely, Teruhiko Yoshida, Sean M Grimmond, Andrew V Biankin, David D L Bowtell, Nicole Cloonan, Anna DeFazio, James R Eshleman, Dariush Etemadmoghadam, Brooke B Gardiner, Brooke A Gar- diner, James G Kench, Aldo Scarpa, Robert L Sutherland, Margaret A Tempero, Nicola J Waddell, Peter J Wilson, John D McPherson, Steve Gallinger, Ming-Sound Tsao, Patricia A Shaw, Glo- ria M Petersen, Debabrata Mukhopadhyay, Lynda Chin, Ronald A DePinho, Sarah Thayer, Lakshmi Muthuswamy, Kamran Shazand, Timothy Beck, Michelle Sam, Lee Timms, Vanessa Ballin, Youyong Lu, Jiafu Ji, Xiuqing Zhang, Feng Chen, Xueda Hu, Guangyu Zhou, Qi Yang, Geng Tian, Lianhai Zhang, Xiaofang Xing, Xianghong Li, Zhenggang Zhu, Yingyan Yu, Jun Yu, Huanming Yang, Mark Lathrop, J¨orgTost, Paul Brennan, Ivana Holcatova, David Zaridze, Alvis Brazma, Lars Egevard, Egor Prokhortchouk, Rosamonde Elizabeth Banks, Mathias Uhl´en,Anne Cambon-Thomsen, Juris Viksna, Fredrik Ponten, Konstantin Skryabin, Michael R Stratton, P Andrew Futreal, Ewan Birney, Ake Borg, Anne-Lise Børresen-Dale, Carlos Caldas, John A Foekens, Sancha Martin, Jorge S Reis- Filho, Andrea L Richardson, Christos Sotiriou, Hendrik G Stunnenberg, Giles Thoms, Marc van de Vijver, Laura van’t Veer, Fabien Calvo, Daniel Birnbaum, H´el`eneBlanche, Pascal Boucher, San- drine Boyault, Christian Chabannon, Ivo Gut, Jocelyne D Masson-Jacquemier, Mark Lathrop, Iris Pauport´e,Xavier Pivot, Anne Vincent-Salomon, Eric Tabone, Charles Theillet, Gilles Thomas, J¨org Tost, Isabelle Treilleux, Fabien Calvo, Paulette Bioulac-Sage, Bruno Cl´ement, Thomas Decaens, Fran¸coiseDegos, Dominique Franco, Ivo Gut, Marta Gut, Simon Heath, Mark Lathrop, Didier Samuel, Gilles Thomas, Jessica Zucman-Rossi, Peter Lichter, Roland Eils, Benedikt Brors, Jan O Korbel, Andrey Korshunov, Pablo Landgraf, Hans Lehrach, Stefan Pfister, Bernhard Radlwimmer, Guido Reifenberger, Michael D Taylor, Christof von Kalle, Partha P Majumder, Rajiv Sarin, T S Rao, M K Bhan, . International network of cancer genome projects. Nature, 464(7291):993–8, Apr 2010. [56] Joanna Amberger, Carol Bocchini, and Ada Hamosh. A new face and new challenges for online mendelian inheritance in man (omim( R ) ). Hum Mutat, 32(5):564–7, May 2011. [57] Joke Reumers, Sebastian Maurer-Stroh, Joost Schymkowitz, and Frederic Rousseau. Snpeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous snps. Bioin- formatics, 22(17):2183–5, Sep 2006. [58] Ursula Pieper, Narayanan Eswar, Fred P Davis, Hannes Braberg, M S Madhusudhan, Andrea Rossi, Marc Marti-Renom, Rachel Karchin, Ben M Webb, David Eramian, Min-Yi Shen, Libusha Kelly, Francisco Melo, and Andrej Sali. Modbase: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res, 34(Database issue):D291–5, Jan 2006.

[59] Nathan O Stitziel, T Andrew Binkowski, Yan Yuan Tseng, Simon Kasif, and Jie Liang. toposnp: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res, 32(Database issue):D520–2, Jan 2004. [60] Carles Ferrer-Costa, Josep Lluis Gelp´ı, Leire Zamakola, Ivan Parraga, Xavier de la Cruz, and Modesto Orozco. Pmut: a web-based tool for the annotation of pathological mutations on pro- teins. Bioinformatics, 21(14):3176–8, Jul 2005.

17 [61] Peng Yue, Eugene Melamud, and John Moult. Snps3d: candidate gene and snp selection for associ- ation studies. BMC Bioinformatics, 7:166, Jan 2006. [62] Paul D Thomas, Michael J Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak, Robin Daverman, Karen Diemer, Anushya Muruganujan, and Apurva Narechania. Panther: a library of protein families and subfamilies indexed by function. Genome Res, 13(9):2129–41, Sep 2003. [63] Joshua S Kaminker, Yan Zhang, Colin Watanabe, and Zemin Zhang. Canpredict: a computa- tional tool for predicting cancer-associated missense mutations. Nucleic Acids Res, 35(Web Server issue):W595–8, Jul 2007. [64] J Stoyanovich and I Pe’er. Mutagenesys: Estimating individual disease susceptibility based on genome-wide snp array data. Bioinformatics, Nov 2007. [65] Alper Uzun, Chesley M Leslin, Alexej Abyzov, and Valentin Ilyin. Structure snp (stsnp): a web server for mapping and modeling nssnps on protein structures with linkage to metabolic pathways. Nucleic Acids Res, 35(Web Server issue):W384–92, Jul 2007. [66] Wing Chung Wong, Dewey Kim, Hannah Carter, Mark Diekhans, Michael C Ryan, and Rachel Karchin. Chasm and snvbox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics, 27(15):2147–8, Aug 2011. [67] Jose MG Izarzugaza, Miguel Vazquez, Angela Pozo, and Alfonso Valencia. wkinmut: An integrated framework for the analysis and interpretation of mutations in the protein kinase superfamily. Nucleic Acids Res, Submitted, 2012. [68] Matthew A Care, Chris J Needham, Andrew J Bulpitt, and David R Westhead. Deleterious snp prediction: be mindful of your training data! Bioinformatics, 23(6):664–72, Mar 2007. [69] C Blaschke and A Valencia. The potential use of suiseki as a protein interaction discovery tool. Genome Inform, 12:123–34, Jan 2001. [70] Martin Krallinger, Alfonso Valencia, and Lynette Hirschman. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol, 9 Suppl 2:S8, Jan 2008. [71] Martin Krallinger, Florian Leitner, Carlos Rodriguez-Penagos, and Alfonso Valencia. Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome Biol, 9 Suppl 2:S4, Jan 2008. [72] Martin Krallinger, Florian Leitner, and Alfonso Valencia. Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol, 593:341–82, Jan 2010. [73] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valen- cia. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol., 9 Suppl 2:S1, 2008. [74] Florian Leitner, Andrew Chatr-aryamontri, Scott A Mardis, Arnaud Ceol, Martin Krallinger, Luana Licata, Lynette Hirschman, Gianni Cesareni, and Alfonso Valencia. The febs letters/biocreative ii.5 experiment: making biological information accessible. Nat Biotechnol, 28(9):897–9, Sep 2010. [75] Martin Krallinger, Jos´eM G Izarzugaza, Carlos Rodriguez-Penagos, and Alfonso Valencia. Extrac- tion of human kinase mutations from literature, databases and genotyping studies. BMC Bioinfor- matics, 10 Suppl 8:S1, Jan 2009. [76] Dietrich Rebholz-Schuhmann, Stephane Marcel, Sylvie Albert, Ralf Tolle, Georg Casari, and Harald Kirsch. Automatic extraction of mutations from medline and cross-validation with omim. Nucleic Acids Res, 32(1):135–42, Jan 2004.

18 [77] Florence Horn, Anthony L Lau, and Fred E Cohen. Automated extraction of mutation data from the literature: application of mutext to g protein-coupled receptors and nuclear hormone receptors. Bioinformatics, 20(4):557–68, Mar 2004.

[78] Lawrence C Lee, Florence Horn, and Fred E Cohen. Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput Biol, 3(2):e16, Feb 2007. [79] Christopher J O Baker and Witte Rene. Mutation mining - a prospector’s tale. Journal of Informa- tion Systems Frontiers, 8(1):47–57, Apr 2006.

[80] Muge Erdogmus and Osman Ugur Sezerman. Application of automatic mutation-gene pair extraction to diseases. J Bioinform Comput Biol, 5(6):1261–75, Dec 2007. [81] Ryan T McDonald, R Scott Winters, Mark Mandel, Yang Jin, Peter S White, and Fernando Pereira. An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics, 20(17):3249–51, Nov 2004.

[82] Laura I Furlong, Holger Dach, Martin Hofmann-Apitius, and Ferran Sanz. Osirisv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics, 9:84, Jan 2008. [83] J Gregory Caporaso, William A Baumgartner, David A Randolph, K Bretonnel Cohen, and . Mutationfinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics, 23(14):1862–5, Jul 2007. [84] Alexander A Morgan, Zhiyong Lu, Xinglong Wang, Aaron M Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, J¨orgHakenberg, Chengjie Sun, Heng hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K Bretonnel Cohen, and Lynette Hirschman. Overview of biocreative ii gene normalization. Genome Biol, 9 Suppl 2:S3, Jan 2008. [85] Robert Hoffmann and Alfonso Valencia. Implementing the ihop concept for navigation of biomedical literature. Bioinformatics, 21 Suppl 2:ii252–8, Sep 2005.

19 Figure 1.JPEG Copyright of Frontiers in Physiology is the property of Frontiers Media S.A. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.