Glycoinformatics: Data Mining-Based Approaches

10 CHIMIA 2011, 65, No. 1/2 Glycochemistry today doi:10.2533/chimia.2011.10 Chimia 65 (2011) 10–13 © Schweizerische Chemische Gesellschaft Glycoinformatics: Data Mining-based Approaches Hiroshi Mamitsuka* Abstract: Carbohydrates or glycans are major cellular macromolecules, working for a variety of vital biological functions. Due to long-term efforts by experimentalists, the current number of structurally different, determined carbohydrates has exceeded 10,000 or more. As a result data mining-based approaches for glycans (or trees in a computer science sense) have attracted attention and have been developed over the last five years, presenting new techniques even from computer science viewpoints. This review summarizes cutting-edge techniques for glycans in each of the three categories of data mining: classification, clustering and frequent pattern mining, and shows results obtained by applying these techniques to real sets of glycan structures. Keywords: Data mining · Frequent subtrees · Glycan structures · Machine learning · Probabilistic models 1. Introduction glycans can be trees, consisting of nodes 2. Mining from Glycan Structures: and edges (which connect nodes), corre- Data Mining Techniques for Trees Oligosaccharides and glycans are major sponding to monosaccharides and chemi- cellular macromolecules associated with cal linkages, respectively. Moreover gly- In general, current approaches for min- a variety of important biological phenom- cans are rooted directed ordered trees being from data or machine learning can ena, including antigen-antibody interac- cause i) a glycan connects to a protein (an be classified into roughly three types: i) tion[1] and cell fate controlling.[2] Pro- amino acid) by only one monosaccharide, classification (or supervised learning), teins and DNAs consist of twenty types called the root, ii) connections are directed ii) clustering (or unsupervised learning) of amino acids and four types of nucleo- from the root to leaves, meaning that two and iii) frequent pattern mining.[7,8] Here tides, respectively. Likewise glycans have connected monosaccharides can be a par- we briefly explain the difference between building blocks, called monosaccharides, ent (closer to the root) and a child (closer these concepts, under which the input is which however are more diverse, the ma- to a leaf), ancestors and descendants being always called examples (which are in our jor ones being fructose, galactose, glucose able to be defined in a similar manner, and case rooted directed ordered trees). In clas- and mannose.[3] The uniqueness of carbo- iii) monosaccharides (children) can be or- sification, examples are labeled or class hydrates is in the connection of monosac- dered by carbon numbers attached to link- labels are attached to examples, and the charides, where two or more monosaccha- ages to another monosaccharide (parent), purpose is to make a classifier/predictor rides can connect to one monosaccharide meaning that children are ordered from which can predict a class to be assigned to without any cycle, forming branch-shaped the oldest sibling to the youngest sibling. a newly given example. This is supervised extensions. In a computer science sense, Hereafter a subtree means a node in a tree learning. On the other hand, in clustering, and its all descendants and edges from that examples are not labeled, and the purpose node to leaves, while a supertree of tree T here is to assign examples to some groups is a tree having T as a subtree. The unique by which we can see the similarity between structure of carbohydrates makes them examples, such as if two examples are in different from other macromolecules such the same group, they are similar. This is as proteins and DNAs which are simpler unsupervised learning, which can give la- sequences of building blocks. Furthermore bels to examples, while labels are given in this uniqueness has made it hard to de- supervised learning. In frequent pattern termine the tree structures of glycans ex- mining, the input is unlabeled examples perimentally by which the size of glycan (so in some sense this is unsupervised structure databases has been kept small learning, but in this review we do not cat- and its speed to develop has been forced egorize frequent pattern mining into unsu- to be slow. However, due to the develop- pervised learning). The purpose is to find ment of carbohydrate research, the current patterns which appear a larger number of number of structurally different glycans in times than a prefixed number, by which we a major database on glycans reaches more can see what patterns appear frequently. than ten thousands, which will be further increased in the near future by using high- 2.1 Classification over Trees throughput techniques.[4] Thus in glycoin- Currently the most popular approach *Correspondence: Prof. H. Mamitsuka formatics, developing efficient approaches in supervised learning is so-called sup- Kyoto University Institute for Chemical Research for mining rules or patterns from trees and port vector machines (SVMs), in which Bioinformatics Center applying the approaches to a glycan data- we use a kernel function that represents a Gokasho, Uji 611-0011, Japan base would be reasonable and promising similarity between two examples.[9] The Tel.: +81 774 38 3023 Fax: +81 774 38 3037 to find biological significance embedded objective here is to find a classifier which E-mail: [email protected] in glycan structures.[5,6] can separate examples in one class from Glycochemistry today CHIMIA 2011, 65, No. 1/2 11 those in the other class (if class labels (a): Sample rooted directed ordered tree (b):Probablistic dependencies of HTMM are binary). The simplest way is to use a linear function, corresponding to using an inner product as a kernel function. However, it would be hard to separate examples depending upon class labels by using a simple linear function. This means that it is important to design a good kernel function by which we can separate examples easily. For the case that examples are trees, a kernel function for computing a similarity between two input trees has been already proposed.[10] The idea be- hind the ‘tree’ kernel is to check subtrees, which appear in the two input trees com- monly, and if they are bigger and/or the number of common subtrees is larger, the two input trees should be more similar. This can be computed in a very efficient manner by using dynamic programming, and a tree kernel specialized for glycans (c): Probablistic dependencies of OTMM (d):Probablistic dependencies of PSTMM was also proposed.[11] 2.2 Clustering: Probabilistic Models for Trees There are a variety of approaches in unsupervised learning. In this review, we focus on probabilistic models, which are models with probability parameters, to be estimated from given data. Rather than clustering, probabilistic models allow binary classification if binary classes are like the examples in question and others. This case, after training a probabilistic model from examples in question, we can com- pute the likelihood for any given example, showing how likely the given example can be in the class in question. A standard Fig. 1. (a) A sample rooted directed ordered tree and probabilistic dependencies of (b) HTMM, probabilistic model for time-series data (c) OTMM and (d) PSTMM or sequences is the hidden Markov model (HMM), which has been used in a lot of Figure 1: (a) A sample rooted directed ordered tree and probabilistic dependencies of (b)HTMM, applications, including speech recogni- (c) OTMM and (d)PSTMM. tion,[12] natural language processing[13] and transition diagram). Training (or learning) tree, where the top square means the root, analyzing biological sequences, e.g. amino probability parameters of HMM is usu- solid lines show directed edges (chemi- acid sequences.[14] A HMM can be defined ally based on maximum likelihood, which cal linkages) from parents to children and by a state transition diagram, in which means that probability parameters are es- dotted lines show ordered children. Fig. states are connected by edges. A HMM timated so that the likelihoods of given 1 (b) shows a HTMM over the tree in (or its state transition diagram) has two sequences are maximized. There are a va- (a), where you can see that thick lines of types of probability parameters, one being riety of extensions of HMMs, major ones state transitions are always directed from letter generation probabilities attached to being context free grammars[15] and tree parents to children. Fig. 1 (c) shows an states to generate letters, i.e. amino acid grammars[16] to deal with complex depen- OTMM over the tree in (a), where you can types, and the other being state transition dencies in sequences. Another direction is see that thick lines of state transition are probabilities attached to edges. A standard for trees, the first trial being the hidden tree from parents to children if the children are assumption on the Markov process is the Markov model (HTMM)[17] in which left- the oldest; otherwise thick lines are from first-order Markov property, which is very to-right on a sequence is simply applied to older siblings to younger siblings. That simple for sequences and means that the parent-to-child over a tree. This model can is, OTMM considers ordered children as current state depends upon only one state be applied to glycans. However, glycans well as parent-to-child dependencies via away. Thus given a sequence, we can just are rooted directed ordered trees, while parents and the oldest siblings. Fig. 1 (d) repeat the following two steps, from left ordered children are not considered in HT- shows a PSTMM over the tree in (a), where to right on the sequence: we first generate MM at all. Thus the ordered tree Markov you can see that thick lines of state transi- the corresponding letter, i.e.

Load more