Gene Function Prediction

GENE FUNCTION PREDICTION

Amal S. Perera William Perrizo Computer Science & Eng. Computer Science University of Moratuwa. North Dakota State University Moratuwa, Sri Lanka. Fargo, ND 58105 [email protected] [email protected]

function from genes in the same cluster with known Abstract function 4. Alternatively, function has been directly Comprehensive Genome Databases provide predicted from gene expression data using classification several catalogues of information on some genomes techniques such as Support Vector Machines 5. such as yeast (Saccaromyces cerevesiae). This paper presents the potential for using gene annotation data, We show the use of gene annotation data, which which includes phenotype, localization, protein class, includes phenotype, localization, protein class, complexes, enzyme catalogues, pathway information, complexes, enzyme catalogues, pathway information, and protein-protein interaction, in predicting the and protein-protein interactions, in predicting the functional class of yeast genes. We predict a rank functional class of yeast genes. Phenotype data has been ordered list of genes for each functional class from used to construct individual rules that can predict available information and determine the relevance of function for certain genes based on the C4.5 decision different input data, through systematic weighting using tree algorithm 6. The gene annotation data we used was a genetic algorithm. The classification task has many extracted from the Munich Information Center for unusual aspects, such as multi-valued attributes, Protein Sequences (MIPS) database 7 and then classification into overlapping classes, a hierarchical processed and formatted in a way that fits our purpose. structure of attributes, many unknown values, and MIPS has a genomic repository of data on interactions between genes. We use a bit-vector Saccharomyces cerevesiae (yeast). Functional classes approach that treats each domain value as an attribute. from the MIPS database include among many others, The weight optimization uses Receiver Operating Metabolism, Protein Synthesis and Transcription. Each Characteristic evaluation in the fitness measure to of these, is in turn divided into classes and then divided handle the sparseness of the data. again into subclasses to yield a hierarchy of up to five 1 INTRODUCTION levels. Each gene may have more than one function associated with it. In addition MIPS has catalogues with During the last three decades, the emergence of other information such as phenotype, localization, molecular and computational biology has shifted protein class, protein complexes, enzyme catalogue and classical genetics research to the study, at the molecular data on protein-protein interactions. In this work, we level of the genome structure of different organisms. As predict functions for genes at the highest level in the a result, the rapid development of more accurate and functional hierarchy. Despite the fact that yeast is one of powerful tools for this purpose, has produced the most thoroughly studied organisms, the function of overwhelming volumes of genomic data, that is being 30 – 40 % of its ORFs remain currently unknown. For accumulated in more than 300 biological databases. about 30% of the ORFs no information whatsoever is Gene sequencing projects on different organisms have available, and for the remaining ones unknown helped to identify tens of thousands of potential new attributes are very common. This lack of information genes, yet their biological function often remains creates interesting challenges that cannot be addressed unclear. Different approaches have been proposed for with standard data mining techniques. We have large-scale gene function analysis. Traditionally, developed novel tools, with the hope of helping functions are inferred through sequence similarity biologists by providing experimental direction, algorithms 1 such as BLAST or PSI-BLAST 2. concerning the function of genes with unknown functions. Similarity searches have some shortcomings. The function of genes that are identified as similar may be 2 EXPERIMENTAL METHOD unknown, or differences in the sequence may be so significant as to make conclusions unclear. For this Gene annotations are atypical data mining reason some researchers use sequence as an input to attributes in many ways. Each of the MIPS catalogues classification algorithms to predict function 3. Another has a hierarchical structure. Each attribute, including common approach to function prediction uses two steps. the class label, gene function, is furthermore multi- Genes are first clustered, based on similarity in valued. We therefore have to consider each domain expression data, and then clusters are used to infer value as a separate binary attribute rather than assigning labels to protein classes, phenotypes, etc. For the class inequality, and reflexive, to be categorized as a metric. label, this means that we must classify genes into Dissimilarity functions that show partial conformity to overlapping classes, which is also referred, to as multi- the above properties are considered as pseudo metrics. label classification, rather than multi-class classification The corresponding dissimilarity measure only shows the in which the class label is disjoint. To this end we triangular inequality when M ≤ N/2 where N is the represent each domain value of each property as a number of dimensions and M is max{X Y} . For this binary attribute that is either one (gene has the property) application we find the use of the above as appropriate. or zero (gene doesn't have the property). This It is also important to note that in this application the representation has some similarity to bit-vector existence of a categorical attribute for a given object is representations in Market Basket Research, in which the more valuable than the non-existence of a certain items in a shopping cart are represented as 1-values, in categorical attribute. In other words "1" is more the bit-vector of all items in a store. The classification valuable than a "0" in the data. Therefore, the count of problem is correspondingly broken up into a separate matching "1" is more important for the task than the binary classification problem, for each value in the count of matching "0". The P-tree data structure we use function domain. The resulting classification problem also allows us to easily count the number of matching has more than one thousand binary attributes, each of "1"s with the use of a root count operation. which is very sparse. Two attribute values should be considered related, if both are ‘1’, i.e., both genes have Similarity is calculated considering the matching a particular property. Not much can be concluded if similarity at each individual level. The total similarity is two attribute values are 0, i.e., both genes do not have a the weighted sum of the individual similarities at each particular property. level on the hierarchy. The total weight for attributes that match at multiple levels is thereby higher indicating Classification is furthermore influenced by the a closer match. Counting matching values corresponds hierarchical nature of the attribute domains. Many to a simple additive similarity model. Additive models models exist for the integration of hierarchical data into are often preferred for problems with many attributes data mining tasks, such as text classification, mining of because they can better handle the low density in association rules, and interactive information retrieval, attribute space, also referred to as the "curse of among others16171819. Recent work 15 introduces dimensionality 8." similarity measurements that can exploit a hierarchical domain, but focuses on the case where matching 3 SIMILARITY WEIGHT OPTIMIZATION attributes are confined to be at the leaf level of the hierarchy. The data set we consider in this work has Similarity models that consider all attributes as poly-hierarchical attributes, where attributes must be equal, such as K-Nearest-Neighbor classification matched at multiple levels. Hierarchical information is (KNN) work well when all attributes are similar in their represented using a separate set of attribute-bit columns relevance to the classification task. This is, however, for each level where each binary position indicates the often not the case. The problem is particularly presence or the absence of the corresponding category. pronounced for categorical attributes that can only have Evidence for the use of multiple binary similarity two distances, namely distance zero if attributes are metrics is available in the literature and usage is based equal and one or some other fixed distance, if attributes on the computability and the requirement of the are different. Many solutions have been proposed, that application 22. In this work we use a similarity metric weight dimensions, according to their relevance to the identified in the literature as “Russel-Rao” and the classification problem. The weighting can be derived as definition is given below 22. Given two binary vectors part of the algorithm 9. In an alternative strategy the Zi and Zj with N dimensions (categorical values) attribute dimensions are scaled, using, e.g., a genetic algorithm, to optimize the classification accuracy of a N 1 separate algorithm, such as KNN 10. Our algorithm is similar to the second approach, which is slower but  Z i,k  Z j,k k 0 more accurate. Modifications were necessary due to the similarity(Zi, Zj)  nature of the data. Because class label values had a N relatively low probability of being ‘1’ we chose to use AROC values instead of accuracy, as a criterion for the The above similarity measure will count the optimization 11. Nearest neighbor evaluation was number of matching “1” bits in the two binary vectors replaced by the counting of matches as described above. and divide it by the total number of bits. A similarity We furthermore included importance measurements into metric is defined as a function that assigns a value to the the classification that are entirely independent of the degree of similarity between object i and j. For each neighbor concept. We evaluate the importance of a gene similarity measure the corresponding dissimilarity is based on the number of possible genetic and physical defined as the distance between two objects and should interactions its protein has, with the proteins of other be non-negative, commutative, adhere to triangle genes. Interactions with lethal genes, i.e., genes that cannot be removed in gene deletion experiments, 5 DATA REPRESENTATION because the organism cannot survive without them, were considered separately. The number of items of The input data was converted to P-trees 121314. known information, such as localization and protein P-trees are a lossless, compressed, and data-mining- class, was also considered as an importance criterion. ready data structure. This data structure has been successfully applied in data mining applications ranging 4 ROC EVALUATION from Classification and Clustering with K-Nearest- Neighbor, to Classification with Decision Tree Many measures of prediction quality exist, with Induction, to Association Rule Mining for real world the best-known one being prediction accuracy. There data 121314. A basic P-tree represents one attribute bit are several reasons why accuracy is not a suitable tool that is reorganized into a tree structure by recursively for our purposes. One main problem derives from the sub-dividing, while recording the predicate truth value fact, that commonly, only few genes are involved regarding purity for each division. Each level of the tree (positive) in a given function. This leads to large contains truth-bits that represent pure sub-trees and can fluctuations in the number of correctly predicted then be used for fast computation of counts. This participant genes (true positives). Secondly we would construction is continued recursively down each tree like to get a ranking of genes rather than a strict path until a pure sub-division is reached that is entirely separation into participant and non-participant, since our pure. The basic and complement P-trees are combined results may have to be combined with independently using Boolean algebra operations to produce P-trees for derived experimental probability levels. Furthermore, values, entire tuples, value intervals, or any other we have to account for the fact that, not all functions of attribute pattern. The root count of any pattern tree will all genes have been determined yet. Similarly there indicate the occurrence count of that pattern. The P-tree may be genes that are similar to ones that are involved data structure provides the perfect structure for counting in the function, but are not experimentally seen as such patterns in an efficient manner. The data representation due to masking. Therefore it may be more important can be conceptualized as a flat table in which each row and feasible to recognize a potential candidate than to is a bit vector, containing a bit for each attribute or part exclude an unlikely one. This corresponds to the of attribute for each gene. Representing each attribute situation faced in hypothesis testing: A false negative, bit, as a basic P-tree generates a compressed form of i.e., a gene that is not recognized as a candidate, is this flat table. considered more important than a false positive, i.e., a gene that is considered a candidate although it isn't involved in the function. Experimental class labels and the other categorical attributes were each encoded in single bit The method of choice for this type of situation columns. Protein-interaction was encoded using a bit is ROC (Receiver Operating Characteristic) analysis 11. column for each possible gene in the data set, where the ROC analysis is designed to determine the quality of existence of an interaction with that particular gene was prediction of a given property, such as a gene being indicated with a truth bit. involved in a phenotype. Samples that are predicted as positive and indeed have that property are referred to as 6 IMPLEMENTATION true positive samples, where as samples that are This work equired data extraction from the MIPS negative, but are incorrectly classified as positive, are database, data cleaning, developing a similarity metric false positive. The ROC curve depicts the rate of true and optimizing the similarity metric. Figure 1. Outline positives, as a function of the false positive rate, for all possible probability thresholds. A measure of quality of prediction is the area under the ROC curve. Our Standard GA Standard GA prediction results are all given as values for the area Wt ROC under the ROC curve (AROC). To construct a ROC W i curve, samples are ordered in decreasing likelihood of DomainDomain Train being positive. The threshold that delimits prediction as KnowledgeKnowledge TrainList Nearest Neighbor positive is then continuously varied. If all true positive NearestPredictor Neighbor samples are listed, first the ROC curve will start out by SimilarityPredictor based on Data Cleaning, Similarityweighted basedsum of on following the y-axis, until all positive samples have Data Cleaning, Derived matchingweighted attributessum of been plotted and then continue as horizontal for the Derived Gene attributes Gene matching attributes attributes Data negative samples. With appropriate normalization the Data area under this curve is one. If samples are listed in HTML data random order the rate of samples that are true positive Test HTMLextractors data Optimized Predictor and ones that are false positive will be equal and the TestList ROC curve will be a diagonal with area 0.5. Final MIPS Prediction of Approach.shows an outline of the approach. Algorithm (GA) 20. The set of weights on the features represented the solution space that was encoded for the Figure 1. Outline of Approach. GA. The AROC value of the classified list was used as the GA fitness evaluator in the search for an optimal Gene data from the MIPS database was retrieved solution. AROC evaluation provides a smoother and using HTML data extractors. With assistance from accurate evaluation of the success of the classifier than domain knowledge a few tools were developed to clean standard accuracy which is a vital ingredient for a the data and also extract some derived attributes as successful hill climb. training data. One of the steps we took in data cleaning consisted in extracting the Function sub-hierarchy “Sub- 7 RESULTS cellular localization” from the MIPS Function hierarchy and treating it as a property in its own right. The basis We demonstrated the prediction capability of the for this step was the recognition that Sub-cellular proposed algorithm, through simulations that predicted Localization represents localization information rather different function classes at the highest level in the than function information. The information in Sub- hierarchy. Class sizes ranged from small classes with cellular Localization can be seen as somewhat only a very few genes, e.g., Protein activity regulation redundant to the separate localization hierarchy. We did with 6 genes to large classes such as Transcription with not explicitly decide in favor of either but rather left that 769 genes, or more than 10% of the genome. We decision to the weight optimization algorithm. Sub- compared our results with 6 since this is the most cellular Localization could be seen to perform closely related work we have found that uses gene somewhat better than the original localization annotation data (phenotype). Differences with respect to information, suggesting that the data listed under Sub- our approach are however such that comparison is cellular Localization may be cleaner than localization difficult. In 6 a set of rules for function classification is data. produced and mean accuracies and the biological relevance of some of the significant rules are reported. The following equations summarize the The paper does not have the goal of producing an computation using P-trees, where Rc: root-count, W: overall high prediction accuracy, and does therefore not weight, Px:P-tree for attribute x (Pg is the interaction P- evaluate such accuracy. Since we build a single tree for gene g), At: attribute count, ptn: class-partition, classifier for each function class, we must compare our Im: gene-importance, Lth: lethal gene (cannot be results with the best rule for each function class. We removed in gene deletion experiments), g: test gene to were able to download the related rule output files be classified, f: feature, Ip: Interaction property, gn: referred to in 6 and pick the best accuracy values genetic interaction (derived in vivo), ph:physical recorded for each type of classification. These values interaction (derived in vitro), lt: lethal, ClassEvl: were used for the comparison of our classifier. As evaluated value for classification, : P-tree AND discussed previously we strongly discourage the use of operator. standard accuracy calculations for the evaluation of classification techniques, but use it here for comparison purposes. Work on 6 does not report rules for all the functional classes we have simulated. ‘ROC’ and ‘Acc’ indicate the accuracy indicators for our approach, for the respective function classes. Accuracy values, for the comparative study 6 on functional classification, are given as “CK_Acc”. As it can be seen in Figure 2 for the classes where the accuracy values are available, our approach performs better in all cases except with For each feature attribute (f) of each test gene, g, Acc. Classification Accuracy the count of matching features for the required partition ROC Acc CK_Acc was obtained from the root-count by ANDing the respective P-trees. We can obtain the number of lethal genes interacting with a particular gene, g, with one P- 0.8 tree AND operation. It is possible to retrieve the required counts without a database scan. 0.4 Due to the diversity of the function class and the nature of the attribute data, we need a classifier that 0 does not require a fixed importance measure on its … S N Y L A T E I N D S A N N R M S O IO N G N IO N T IO A IS features, i.e., we need an adaptable mechanism to arrive E I T A M R N D E A L L H T P IS E IG G T M F T U T A I T N N S D C L LA L N O at the best possible classifier. In our approach we IT R R E R / N IN A N L L IO B N L C O A A N A S R O E U E T A Y I S P H L O S E R C G C A T optimize the weight space, W, with a standard Genetic S C C I E E T I E E A N S E LU T L V R F IZ M IN F A N L A C C IN N O N E R A M E C Y O / E Y L A T T T R T I R R T C N C P F R IT O G O O R U L O A IV R R R P R O L L T O P S A P M E N U T N L S M C O L C N N I L A O A LU A O T E C R L C A IN T E R L C E C T U H T G IT O E R Function Class R W P Metabolism. Figure 3. Optimized Weights. Figure 2. Classification Accuracy. It is worth analyzing, if the effect is a simple result of the number of experiments done on a gene. If As observed in previous work 621 different many experiments have been done, many properties functional classes are predicted better by different may be known and consequently, the probability of attributes. This is evident in the final optimized weights. knowing a particular function may be higher. Such an As indicated in Figure 5-4 the weights do not show a increased probability of knowing a function would not strong pattern indicating a global set of weights that help us, in predicting unknown functions of genes, on could work for all the classifications. Overall the which few experiments have been done. To study this protein-class feature had higher weights in predicting we tried including the total number of known items of the functions for most of the functional categories information, such as the total number of known followed by complexes. In the protein synthesis localizations, protein classes, etc. as a component. The functional category the combined similarity in weights of this component were almost all zero. This complexes (weight = 1.0), protein class (weight = suggests that the number of interactions is indeed a 1.033), and Phenotype (weight=1.2) lead to significant predictor, of the likelihood of a function, rather than of prediction accuracy with AROC = 0.90. Transport the likelihood of knowing about a function. The facilitation had a high AROC value of 0.87 from the variations in the weight values show the degree of combined similarity, in protein class (weight=1.87), importance of each attribute in predicting the function localization (weight=1.46), and complexes class. (weight=1.1). This observation highlights the importance of using protein class and complexes as Use of a genetic algorithm requires the possible experimental features for biologists in the evaluation of the fitness function on the training data. identification of gene functions. This requires a data scan for each evaluation. In this work the reported results were obtained by 100x40 We include the number of interactions in such a evaluations for the genetic algorithm for each function way that many interactions directly favor classification class prediction. With an implementation using the P- into any of the function classes. This contribution to the tree API framework 23 each function class prediction classification is not directly based on similarity with takes 0.027 milliseconds on an Intel P4 2.4 GHz training samples, and is thereby fundamentally different machine with 2Gb of memory. The corresponding time from the similarity based contributions. This is an for a horizontal implementation is 0.834 milliseconds. unusual procedure that is justified by the observation In the horizontal approach the algorithm needs to go that it improves classification accuracy, which can be through all the attributes of a gene, blindly, to compute seen from the non-zero weights for interaction numbers. the matching attribute count for a given gene. In From a biological perspective, the number of contrast, in the vertical approach we can count the interactions is often associated with importance of a matching attributes by counting only the P-trees for gene. An important gene is more likely to be involved those attributes that are present for the given gene to be in many functions than and unimportant one. The predicted. The large difference in compute time clearly weights in Figure 3 show that for all function classes indicates the applicability of the vertical approach for interactions helped in classification. this application.

8 CONCLUSION

GA optimized weights We were able to demonstrate the successful PROTEIN SYNTHESIS TRANSPORT FACILITATION TRANSCRIPTION classification of gene function, from gene annotation Weight data. The multi-label classification problem was solved 2 by a separate evaluation of rank ordered lists of genes 1.6 associated with each function class. A similarity 1.2 measure was calculated using weighted counts of

0.8 matching attribute values. The weights were optimized through a Genetic Algorithm. Weight optimization was 0.4 shown to be indispensable, as different functions were 0 predicted by strongly differing sets of weights. The s x y e t e n n s ex a ue e s s s n n en o iot a le a u py n n n ioi u en AROC value was shown to be a successful fitness ti at ll p w og ty ioti tioi tioi ct o a c C m th llo o tc ct ct ac C G c o om a ta n ac ac ac ra l l evaluator for the GA. Results of the weight optimization o L ini P a e rr r r te te a L r te C h tet te te nt u tht ra o n C P n n n I b e la r i e I I I I rii L can be directly used in giving biologists an indication, lu P te l l c ll ini t e o ym a tii ca te A c r zy tht e iic ot to what defines a particular function. b P nz e n s ro u E L e y P S E G h - G P ini te ro Furthermore, it was interesting to note that AttribA ttribute ute P quantitative information, such as the number of interactions, played a significant role. In summary, we on Feature Selection and Classification Using found that our systematic weighting approach and P-tree Genetic Algorithms,” 5th International Conf. on representation allowed us to evaluate the relevance of a Genetic Algorithms, pp. 557-564, San Mateo, rich variety of attributes efficiently. CA, July 1993. Acknowledgment: Group members of the [11] F. Provost, T. Fawcett, R. Kohavi, “The Case DataSURG research group at North Dakota State Against Accuracy Estimation for Comparing University. Induction Algorithms,” 15th International Conf. on Machine Learning, pp. 445-453, Madison, 9 REFERENCES WI, July 1998. [1] M. Pellegrini, E.M. Marcotte, M.J. Thompson, D. [12] Q. Ding, Q. Ding, W. Perrizo, “ARM on RSI Eisenberg, T.O. Yeates, “Assigning Protein Using P-trees,” Pacific-Asia KDD Conf. Functions by Comparative Genome Analysis: (PAKDD), pp. 66-79, Taipei, Taiwan, May 2002. Protein Phylogenetic Profiles,” Proc. of the [13] Q. Ding, Q. Ding, W. Perrizo, “Decision Tree National Academy of Sciences, USA, Vol. 96 pp. Classification of Spatial Data Streams Using 4285-4288, 1999. Peano Count Trees,” ACM Symposium on [2] S. F. Altschul, T.L. Madden, A.A. Shaffer, J. Applied Computing (SAC), pp. 426-431, Madrid, Zhang, Z. Zhang, W. Miller, D. J. Lipman, Spain, March 2002. “Gapped BLAST and PSI-BLAST: A New [14] M. Khan, Q. Ding, W. Perrizo, KNN on Data Generation of Protein Database Search Stream Using P-trees, Pacific-Asia KDD Programs,” Nucleic Acids Research, Vol. 25 pp. (PAKDD), pp. 517-528, Taipei, Taiwan, May 3389-3402, 1997. 2002. [3] R.D. King, A. Karwath, A. Clare, L. Dehaspe, [15] G. Prasanna, G-M. Hector, J. Widom, “Exploiting “The Utility of Different Representations of Hierarchical Domain Structure to Compute Protein Sequence for Predicting Functional Similarity,” ACM Transaction on Information Class,” Bioinformatics Vol. 17 No. 5 pp. 445- Systems, Vol. 21, No. 1 pp. 64-93, January 2003. 454, 2001. [16] R. Feldman, I. Dagan, “Knowedge Discovery in [4] L. Wu, T.R. Hughes, A. P. Davierwala, M. D. Textual Databases,” ACM Special Interest Group Robinson, R. Stoughton, S. J. Altschuler, “Large- in Knowledge Discovery and Data Mining scale Prediction of Sccharomyces Cerevisiae (SIGKDD), pp. 112-117, Boston, MA, August Gene Function Using Overlapping 2000. Transcriptional Clusters,” Nature, Vol. 31, pp. 255-265, 2002. [17] J. Han, Y. Fu, Discovering the Multi-level Association Rules from Large Databases. [5] M. Brown, G.W. Nobel, D.Lin, N. Cristianini, C. International Conf. on Very Large Data Bases Sugnet, T. S. Furey, M. Ares Jr., D. Haussler, (VLDB), pp. 420-431, Zurich, Switzerland, “Knowledge-based Analysis of Microarray Gene September 1995. Expression Data using Support Vector Machines,” Proc. of the National Academy of [18] S. Scott, S. Matwin, “Text Classification Using Sciences USA, Vol. 97 No. 1 pp. 262-267, 1997. WordNet Hypernyms,” Proc. of the Use of the WordNet in Natural Language Processing [6] A. Clare and R. D. King “Machine Learning of Systems, pp. 38-44, Association for Functional Class from Phenotype Data,” Computational Linguistics, Somerset, New Bioinformatics, Vol. 18 No. 1 pp. 160-166, 2002. Jersey, August 1998. [7] Munich Information Centre for Protein [19] R. Srikant, R. Agarwal, “Mining Generalized Sequences, “Comprehensive Yeast Genome Association Rules,” International Conf. on Very Database,” http://mips.gsf.de/, accessed in May Large Data Bases (VLDB), pp. 407-419, Zurich, 2006. Switzerland, September 1995. [8] T. Hastie, R. Tibshirani, J. Friedman, “The [20] D.E. Goldberg, “Genetic Algorithms in Search Elements of Statistical Learning: Data Mining, Optimization, and Machine Learning,” Addison Inference, and Prediction,” Springer-Verlag, New Wesley, Boston, MA, 1989. York, NY, 2001. [21] P. Pavlidis, J. Weston, J. Cai, W. Grundy “Gene [9] S. Cost and S. Salzberg, “A Weighted NN Functional Classification from Heterogeneous Algorithm for Learning with Symbolic Features”, Data,” Proc. of the 5th International Conf. on Machine Learning, Vol. 10 pp. 57-78, 1993. Computational Molecular Biology, pp. 249-255, [10] W.F. Punch, E.D. Goodman, M. Pei, L. Chia- Montreal, Canada, April 2001. Shun, P. Hovland, R. Enbody, “Further Research [22] J. D. Tubbs, “A Note on Binary Template Matching,” Pattern Recognition, Vol. 22 No. 4 pp. 359–365, August 1989. [23] A. Perera, M. Serazi, Q. Ding, V. Malakhov, “P- tree Application Programming Interface,” DataSURG, Dept. of Computer Science, North Dakota State University, http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/, accessed in May 2007.