10 CHIMIA 2011, 65, No. 1/2 Glycochemistry Today doi:10.2533/chimia.2011.10 Chimia 65 (2011) 10–13 © Schweizerische Chemische Gesellschaft Glycoinformatics: Data Mining-based ­Approaches

Hiroshi Mamitsuka*

Abstract: or are major cellular macromolecules, working for a variety of vital biological functions. Due to long-term efforts by experimentalists, the current number of structurally different, determined carbohydrates has exceeded 10,000 or more. As a result data mining-based approaches for glycans (or trees in a computer science sense) have attracted attention and have been developed over the last five years, presenting new techniques even from computer science viewpoints. This review summarizes cutting-edge techniques for glycans in each of the three categories of data mining: classification, clustering and frequent pattern mining, and shows results obtained by applying these techniques to real sets of structures. Keywords: Data mining · Frequent subtrees · Glycan structures · Machine learning · Probabilistic models

1. Introduction glycans can be trees, consisting of nodes 2. Mining from Glycan Structures: and edges (which connect nodes), corre- Data Mining Techniques for Trees Oligosaccharides and glycans are major sponding to monosaccharides and chemi- cellular macromolecules associated with cal linkages, respectively. Moreover gly- In general, current approaches for min- a variety of important biological phenom- cans are rooted directed ordered trees be- ing from data or machine learning can ena, including antigen-antibody interac- cause i) a glycan connects to a protein (an be classified into roughly three types: i) tion[1] and cell fate controlling.[2] Pro- amino acid) by only one monosaccharide, classification (or supervised learning), teins and DNAs consist of twenty types called the root, ii) connections are directed ii) clustering (or unsupervised learning) of amino acids and four types of nucleo- from the root to leaves, meaning that two and iii) frequent pattern mining.[7,8] Here tides, respectively. Likewise glycans have connected monosaccharides can be a par- we briefly explain the difference between building blocks, called monosaccharides, ent (closer to the root) and a child (closer these concepts, under which the input is which however are more diverse, the ma- to a leaf), ancestors and descendants being always called examples (which are in our jor ones being fructose, galactose, glucose able to be defined in a similar manner, and case rooted directed ordered trees). In clas- and mannose.[3] The uniqueness of carbo- iii) monosaccharides (children) can be or- sification, examples are labeled or class hydrates is in the connection of monosac- dered by carbon numbers attached to link- labels are attached to examples, and the charides, where two or more monosaccha- ages to another monosaccharide (parent), purpose is to make a classifier/predictor rides can connect to one monosaccharide meaning that children are ordered from which can predict a class to be assigned to without any cycle, forming branch-shaped the oldest sibling to the youngest sibling. a newly given example. This is supervised extensions. In a computer science sense, Hereafter a subtree means a node in a tree learning. On the other hand, in clustering, and its all descendants and edges from that examples are not labeled, and the purpose node to leaves, while a supertree of tree T here is to assign examples to some groups is a tree having T as a subtree. The unique by which we can see the similarity between structure of carbohydrates makes them examples, such as if two examples are in different from other macromolecules such the same group, they are similar. This is as proteins and DNAs which are simpler unsupervised learning, which can give la- sequences of building blocks. Furthermore bels to examples, while labels are given in this uniqueness has made it hard to de- supervised learning. In frequent pattern termine the tree structures of glycans ex- mining, the input is unlabeled examples perimentally by which the size of glycan (so in some sense this is unsupervised structure has been kept small learning, but in this review we do not cat- and its speed to develop has been forced egorize frequent pattern mining into unsu- to be slow. However, due to the develop- pervised learning). The purpose is to find ment of research, the current patterns which appear a larger number of number of structurally different glycans in times than a prefixed number, by which we a major on glycans reaches more can see what patterns appear frequently. than ten thousands, which will be further increased in the near future by using high- 2.1 Classification over Trees throughput techniques.[4] Thus in glycoin- Currently the most popular approach *Correspondence: Prof. H. Mamitsuka formatics, developing efficient approaches in supervised learning is so-called sup- Kyoto University Institute for Chemical Research for mining rules or patterns from trees and port vector machines (SVMs), in which Center applying the approaches to a glycan data- we use a kernel function that represents a Gokasho, Uji 611-0011, Japan base would be reasonable and promising similarity between two examples.[9] The Tel.: +81 774 38 3023 Fax: +81 774 38 3037 to find biological significance embedded objective here is to find a classifier which E-mail: [email protected] in glycan structures.[5,6] can separate examples in one class from Glycochemistry Today CHIMIA 2011, 65, No. 1/2 11

those in the other class (if class labels (a): Sample rooted directed ordered tree (b):Probablistic dependencies of HTMM are binary). The simplest way is to use a linear function, corresponding to us- ing an inner product as a kernel function. However, it would be hard to separate examples depending upon class labels by using a simple linear function. This means that it is important to design a good ker- nel function by which we can separate ex- amples easily. For the case that examples are trees, a kernel function for computing a similarity between two input trees has been already proposed.[10] The idea be- hind the ‘tree’ kernel is to check subtrees, which appear in the two input trees com- monly, and if they are bigger and/or the number of common subtrees is larger, the two input trees should be more similar. This can be computed in a very efficient manner by using dynamic programming, and a tree kernel specialized for glycans (c): Probablistic dependencies of OTMM (d):Probablistic dependencies of PSTMM was also proposed.[11]

2.2 Clustering: Probabilistic Models for Trees There are a variety of approaches in unsupervised learning. In this review, we focus on probabilistic models, which are models with probability parameters, to be estimated from given data. Rather than clustering, probabilistic models allow bi- nary classification if binary classes are like the examples in question and others. This case, after training a probabilistic model from examples in question, we can com- pute the likelihood for any given example, showing how likely the given example can be in the class in question. A standard Fig. 1. (a) A sample rooted directed ordered tree and probabilistic dependencies of (b) HTMM, probabilistic model for time-series data (c) OTMM and (d) PSTMM or sequences is the hidden Markov model (HMM), which has been used in a lot of Figure 1: (a) A sample rooted directed ordered tree and probabilistic dependencies of (b)HTMM, applications, including speech recogni- (c) OTMM and (d)PSTMM. tion,[12] natural language processing[13] and transition diagram). Training (or learning) tree, where the top square means the root, analyzing biological sequences, e.g. amino probability parameters of HMM is usu- solid lines show directed edges (chemi- acid sequences.[14] A HMM can be defined ally based on maximum likelihood, which cal linkages) from parents to children and by a state transition diagram, in which means that probability parameters are es- dotted lines show ordered children. Fig. states are connected by edges. A HMM timated so that the likelihoods of given 1 (b) shows a HTMM over the tree in (or its state transition diagram) has two sequences are maximized. There are a va- (a), where you can see that thick lines of types of probability parameters, one being riety of extensions of HMMs, major ones state transitions are always directed from letter generation probabilities attached to being context free grammars[15] and tree parents to children. Fig. 1 (c) shows an states to generate letters, i.e. amino acid grammars[16] to deal with complex depen- OTMM over the tree in (a), where you can types, and the other being state transition dencies in sequences. Another direction is see that thick lines of state transition are probabilities attached to edges. A standard for trees, the first trial being the hidden tree from parents to children if the children are assumption on the Markov process is the Markov model (HTMM)[17] in which left- the oldest; otherwise thick lines are from first-order Markov property, which is very to-right on a sequence is simply applied to older siblings to younger siblings. That simple for sequences and means that the parent-to-child over a tree. This model can is, OTMM considers ordered children as current state depends upon only one state be applied to glycans. However, glycans well as parent-to-child dependencies via away. Thus given a sequence, we can just are rooted directed ordered trees, while parents and the oldest siblings. Fig. 1 (d) repeat the following two steps, from left ordered children are not considered in HT- shows a PSTMM over the tree in (a), where to right on the sequence: we first generate MM at all. Thus the ordered tree Markov you can see that thick lines of state transi- the corresponding letter, i.e. amino acid, at model (OTMM)[18,19] and the probabilis- tions are always from parents to children a state with a letter generation probability tic sibling-dependent tree Markov model as well as from older siblings to younger and then transit to another state with a state (PSTMM)[20,21] have been developed for siblings. This means that PSTMM always transition probability. glycans. considers both parent-to-child dependen- Using these probabilities, we can com- Fig. 1 shows the difference between cies and dependencies between ordered pute the likelihood that a given sequence HTMM, OTMM and PSTMM. Fig. 1 (a) children. Thus we can see that PSTMM is generated under a HMM (or its state shows a sample rooted directed ordered is the most complex and well-considered 12 CHIMIA 2011, 65, No. 1/2 Glycochemistry Today

model for glycans. However we need to think about other points, such as com- putational complexity of probabilistic Rank p-value Support Pattern models. HTMM is a simple modification of HMM from left-to-right to parent-to- child, meaning that these two models 4 share the same time and space complexi- 1 3 1.6e-46 381 Lewis X ties, which is roughly O(n3) where n is the size of given sequences or trees and the state transition diagram is assumed to be 4 2 6 1.1e-40 164 O-glycan core the complete graph. This good property of 3 HTMM is also shared by OTMM, which also keeps the same complexities. How- Glycosphingolipid 3 3 4 5.0e-26 109 ever, the time and space complexities of 3 core PSTMM are higher and roughly O(n4). This difference is computationally sig- nificant, because the state transition dia- 3 3 Glycosphingolipid gram can be a left-to-light one by which 4 5.6e-26 233 core the complexity of OTMM can be O(n2) but PSTMM must keep O(n3). Another is- 4 sue is that the number of glycans we can 5 3 3 have is always relatively small, meaning 8.6e-26 83 Lewis A that a complex model is likely to overfit to training data. In fact, ref. [19] shows that the predictive performance of PSTMM Fucose (Fuc) is high for training data but very low for test data, while that of OTMM is high for Galactose (Gal) both training and test data, because the N-acetylgalactosamine (GalNAc) high complexity of PSTMM makes this Glucose (Glc) model overfit to the training data. These N-acetylglucosamine (GlcNAc) results imply that OTMM is the most ap- propriate probabilistic model for glycans. On the other hand, PSTMM was applied Fig. 2. to aligning multiple glycans by modifying it into a model called ProfilePSTMM,[22] which is a similar modification of HMM to ProfileHMM. This modification reduc- subtree is frequent, its all subtrees are fre- Fig. 2 shows the top five patterns ob- es the complexity of the original PSTMM quent. This means that frequent subtrees tained by running this approach over all by which ProfilePSTMM can avoid the are sometimes so small and meaningless. glycans in a glycan database, revealing overfitting issue. To overcome these issues in mining fre- that each of the five patterns corresponds quent subtrees, we present a new method, to a known, conserved pattern of glycans. 2.3 Mining Frequent Subtrees which has two important features: 1) con- Furthermore you can see that these pat- Classification and clustering are very trolling the number of outputted frequent terns are not ranked by their supports but useful in that they can give some infor- subtrees and 2) using hypothesis testing to by their p-values. mation to examples. On the other hand, output only significant subtrees.[23] That Obtained patterns can be used to gen- if classifiers or probabilistic models are is, we first reduce the number of frequent erate a binary feature vector of each gly- complex, it is hard to obtain any rules from subtrees by step 1 and then remove mean- can, by assigning 1 to a feature if this given examples. Mining frequent patterns ingless frequent subtrees by step 2. Step glycan has the pattern corresponding to is useful to find rules or patterns embed- 1 was realized by the concept, called a- this feature; otherwise zero. We can then ded in given examples, and these obtained closed frequent subtrees, which means apply these feature vectors to a classifica- patterns can further work for classifica- that a frequent subtree T is not outputted tion problem of glycans by using a linear tion and clustering. Thus mining frequent if the support of its supertree is larger than kernel and SVM. Fig. 3 shows the perfor- patterns have been developed for different or equal to a × the support of T, where a mance (shown by AUC, Area Under the types of examples, such as items,[8] trees[23] takes a value between zero and one. That ROC curve, a standard measure in machine and graphs.[8] Here we focus on trees or is, the intuitive idea of a-closed frequent learning) of this approach (displayed by glycans, and in fact, conserved patterns subtrees is that frequent subtree T is not ‘Proposed Method’), outperforming those have already been reported in glycans,[24] outputted if the support of its supertree has using three well-known tree kernels. This implying that similar subtree patterns a value similar to the support of T. Step 2 result shows that patterns are very useful could still be hidden and not yet found. needs a control dataset, which is for exam- even for classification, mainly because pat- Given a set of trees, the number of trees ple randomly generated, and the support of terns are obtained from given data, mean- having a certain subtree is called the sup- frequent subtrees in the control dataset is ing that patterns can be changed depending port of the subtree. A subtree is frequent computed. Step 2 then runs Fisher’s exact upon given data while tree kernels cannot. if its support is equal to or higher than a test to retrieve only the frequent subtrees, Finally Table 1 shows a summary of da- prefixed number, which is called minimum which appear well in the given dataset but ta mining techniques for glycans or rooted support or minsup. In general, we can have not so frequently in the control dataset. directed ordered trees which we have intro- a huge number of frequent subtrees, which Final results are ranked by p-values of duced in this review. are at the same time redundant, since if a Fisher’s exact test. Glycochemistry Today CHIMIA 2011, 65, No. 1/2 13

Fig. 3. Mining (KDD 2006)’, Eds. T. Eliassi-Rad, L. H. Ungar, M. Craven, D. Gunopulos, ACM Press, 2006, pp. 177. [21] K. Hashimoto, K. F. Aoki-Kinoshita, N. Ueda, 0.94 M. Kanehisa, H. Mamitsuka, ACM Trans. Knowl. Discov. Data 2008, 2, Article 6. [22] K. F. Aoki-Kinoshita, N. Ueda, H. Mamitsuka, M. Kanehisa, Bioinformatics 2006, 22, e25. 0.93 Proposed method [23] K. Hashimoto, I. Takigawa, M. Shiga, M. Convolution kernel Kanehisa, H. Mamitsuka, Bioinformatics 2008, 24, i167. Co-rooted subtree kernel [24] P. M. Lanctot, F. H. Gage, A. J. Varki, Curr. AUC 3-mer kernel 0.92 Opin. Struct. Biol. 2007, 11, 373.

0.91

0.7 0.8 0.9 1 α

Table 1. Technique References Supervised learning: kernel-based method Glycan kernel [11] Unsupervised learning: probabilistic-models OTMM [20,21] PSTMM [18,19] Profile PSTMM [22] Frequent pattern mining Frequent subtree mining [23]

3. Conclusion [7] C. M. Bishop, ‘Pattern recognition and machine learning’, Springer, New York, USA, 2006. [8] J. Han, H. Cheng, D. Xin, X. Yan, Data Min. We have shown three types of problem Knowl. Disc. 2007, 15, 55. settings in data mining and latest tech- [9] J. Shawe-Taylor, N. Cristianini, ‘Kernel niques for glycans or rooted directed or- methods for pattern analysis’, Cambridge dered trees in each type. We have further University Press, Cambridge, UK, 2004. shown that the techniques are proved to be [10] M. Collins, N. Duffy, ‘Proceedings of the 40th Annual Meeting of the Association for useful for real glycans. In particular, min- Computational Linguistics’, ACL, 2002, p. 263. ing frequent patterns obtained well-known [11] Y. Yamanishi, F. Bach, J-P. Vert, Bioinformatics patterns embedded in a glycan database 2007, 23, 1211. automatically and the obtained patterns [12] L. R. Rabiner, B. H. Juang, IEEE ASSP Mag. 1986, 3, 4. outperformed SVMs with tree kernels in [13] ‘Foundations of Statistical Natural Language a supervised learning setting. This implies Processing’, Eds. C. Mannig, H. Schutz, MIT that using frequent patterns is very useful Press, 1999. and promising to further explore data min- [14] ‘Biological Sequence Analysis: Probabilistic ing approaches for glycans. Models of Proteins and Nucleic Acids’, Eds. R. Durbin, S. R. Eddy, A. Krogh, G. Mitchison, Cambridge University Press, 1998. Received: August 31, 2010 [15] K. Lari, S. J. Young, Comput. Speech Lang. 1990, 4, 35. [16] N. Abe, H. Mamitsuka, Mach. Learn. 1997, 29, [1] A. Varki, R. Cummings, J. Esko, H. Freeze, G. 275. Hart, J. Marth, ‘Essentials of glycobiology’, [17] M. Diligenti, P. Frasconi, M. Gori, IEEE T. CSHL Press, New York, USA, 1999. Pattern Anal. 2003, 25, 519. [2] P. Stanley, Curr. Opin. Struct. Biol. 2007, 17, [18] K. F. Aoki, N. Ueda, A. Yamaguchi, M. 530. Kanehisa, T. Akutsu, H. Mamitsuka, [3] J. D. Marth, Nat. Cell Biol. 2008, 10, 1015. Bioinformatics 2004, 20, i6. [4] S. J. Haslam, S. J. North, A. Dell, Curr. Opin. [19] N. Ueda, K. F. Aoki-Kinoshita, A. Yamaguchi, Struct. Biol. 2006, 16, 584. T. Akutsu, H. Mamitsuka, IEEE T. Knowl. En. [5] R. Raman, S. Raguram, G. Venkataraman, J. C. 2005, 17, 1051. Paulson, R. Sasisekharan, Nat. Methods 2005, [20] K. Hashimoto, K. F. Aoki-Kinoshita, N. Ueda, 2, 817. M. Kanehisa, H. Mamitsuka, in ‘Proceedings [6] H. Mamitsuka, Drug Discov. Today 2008, 13, of the Twelfth ACM SIGKDD International 118. Conference on Knowledge Discovery and Data