R.ROSETTA: a Package for Analysis of Rule-Based Classification Models

R.ROSETTA: a package for analysis of rule-based classification models Supplementary material Mateusz Garbulowski1,*, Klev Diamanti1,#, Karolina Smolińska1,#, Patricia Stoll2, Susanne Bornelöv3, Aleksander Øhrn4 and Jan Komorowski1,* 1Department of Cell and Molecular Biology, Uppsala University, Sweden 2Department of Biosystems Science and Engineering, ETH Zurich, Switzerland 3Wellcome Trust Medical Research Council Stem Cell Institute, University of Cambridge, England 4Department of Informatics, Oslo University, Norway #These authors contributed equally to the work as second authors. *Corresponding author E-mail: [email protected] (Mateusz Garbulowski) E-mail: [email protected] (Jan Komorowski) R.ROSETTA is freely available at: https://github.com/komorowskilab/R.ROSETTA Table of contents 1. Supplementary notes ....................................................................................................... 2 1.1. Package architecture .......................................................................................................... 2 1.2. Main upgrades ................................................................................................................... 2 1.1.1 Undersampling .......................................................................................................................... 2 1.1.2 Rule p-value estimation ............................................................................................................. 2 1.1.3 Votes normalization in class prediction ...................................................................................... 2 1.1.4 Retrieval of support sets from rules ............................................................................................ 2 1.1.5 Rule visualization ...................................................................................................................... 2 1.3. Data preprocessing ............................................................................................................ 2 1.4. Feature selection ............................................................................................................... 3 1.5. Feature validation .............................................................................................................. 3 1.6. Classification .................................................................................................................... 3 2. Supplementary figures .................................................................................................... 4 3. Supplementary tables ...................................................................................................... 9 4. Supplementary references ............................................................................................. 17 1. Supplementary notes 1.1. Package architecture The ROSETTA framework comes in a GUI version for Windows systems and a command line version for the UNIX-based systems. The R.ROSETTA package is a cross-platform application that uses command line ROSETTA. However, UNIX-based systems require installation of the compatibility layer software wine. For more information we recommend you read the original ROSETTA articles (Øhrn, 1999; Øhrn, 2000; Øhrn et al., 1998) and the technical reference manual (Øhrn, 2001). Detailed instructions for R.ROSETTA installation, functions and a sample code are available in the package manual. 1.2. Main upgrades 1.1.1 Undersampling Classification models are trained with labeled examples that are a priori assigned to a decision class. Ideally, each decision class shall contain approximately the same number of examples. However, decision classes that are not represented equally may lead to a biased performance of the model. In such cases, we suggested that a number of examples equal to the population of the minority class is randomly sampled without replacement a sufficient number of times. The approach of balancing the data (Liu et al., 2009) is generally known as undersampling. We have implemented an option that creates sets of equal sizes by undersampling the larger sets of the input dataset. By default, each example is selected at least once, although the user can set a custom number of sampled sets, as well as a custom size for each set. The final classification models for each undersampled set are merged into a single model that consist of unique rules from each classifier. The overall accuracy of the model is estimated as the mean value of all the submodels. Finally, all the statistics of each rule shall be recalculated according to the original input dataset. 1.1.2 Rule p-value estimation Classification models generated by R.ROSETTA consist of a set of varying number of rules estimated from different reduct computation algorithms. In case of the Johnson algorithm (Johnson, 1974) this set is manageable, while in case of the Genetic algorithm (Wróblewski, 1995) is considerably large (Supplementary Figure S3; Supplementary Table S4). In both cases, supervised pruning of rules from the models would not heavily affect the overall performance of the classifier. To better assess the quality of each rule we assume a hypergeometric distribution to calculate p-values followed by multiple testing correction. The hypergeometric distribution follows the idea of estimating the power of representation of rule support against the total number of examples. Such optimized models illustrate the essential interactions among features and can be applied to external data. 1.1.3 Votes normalization in class prediction The rule-based models allow straightforward class prediction. Rules can be applied for the classification of unseen data. We collected the votes from the rules for each object individually and examine the predominance of each class. For some models an imbalanced number of rules for the outcome was generated. We propose to adjust for the rule-imbalance by normalizing on vote counts. We implemented various vote normalization methods in R.ROSETTA. Votes normalization can be performed by dividing the number of votes by its mean, median, maximum, total number of rules or square root of the sum of squares. We compared the performance of these methods in (Supplementary Table S8). Yet an additional necessary condition to be fulfilled is that the validation set shall contain the same set of features as the training set. In this case the class prediction is performed by applying the rules to the data. If the condition is fulfilled, the votes are estimated. 1.1.4 Retrieval of support sets from rules R.ROSETTA is able to recover support sets that represent the contribution of the objects to rules (Supplementary Figure S4). As a result, each rule is characterized with the set of objects that fulfill the given IF or IF-THEN component. There are several advantages of knowing such information. Support sets contribute to uncovering objects whose levels of variables might have shared patterns. Such sets may be further investigated to uncover specific subsets within decision classes. Moreover, low accuracy support sets allow detecting objects that may potentially introduce a bias in the model. 1.1.5 Rule visualization The package provides several ways of rule visualization. One of these is a heatmap (Supplementary Figure S4) that illustrates the discrete levels of each feature, that is a member of the selected conjunctive rule for each object. The heatmap gathers the objects into those that belong to the support set for the given class, do not belong to the support set for the given class and the remaining objects for the other classes. The heatmap can assist the interpretation of individual rules of interest and visualization of interactions. A more holistic approach displays the entire model as an interaction network (Shmulevich et al., 2002). The package allows exporting the rules in a line-by-line format which is suitable with rule visualization software such as Ciruvis (Bornelöv et al., 2014) or VisuNet (Supplementary Figure S5) (Anyango, 2016). These approaches provide a different point of view of the rule-based models that allow discovering known proof-of-concept and novel interactions among the features (Dramiński et al., 2016; Enroth et al., 2012). 1.3. Data preprocessing The gene expression data (Alter et al., 2011) for autistic and healthy young males (Supplementary Table S1) was downloaded from the GEO repository, under the accession number GSE25507. Expression levels of 54678 genes were measured with the Affymetrix Human Genome 2 U133 Plus 2.0 array. The so-called autism-control dataset was loaded and processed using the getGEO function from the GEOquery R library (Davis and Meltzer, 2007). The data was normalized with the Robust Multi-array Average (RMA) functions ReadAffy and rma from the affy R package (Gautier et al., 2004). Gene names were annotated using AnnotationDbi (Pages et al., 2008) and hgu133plus2.db (Carlson, 2016) R packages. To identify unknown probe names, the annotation table HG-U133_Plus_2.na36 (http://www.affymetrix.com/site/mainPage.affx) was processed and the probe coordinates were intersected with the unknown probes using the GenomicRanges (Lawrence et al., 2013) R package. For one probe the gene name could not be identified due to the lack of coordinates. The clinical data was investigated for potential batch effects using Pearson correlation with the rcorr function from the Hmisc R library (Harrell and Dupont, 2008). The age of the samples was highly correlated to the outcome (Supplementary

R.ROSETTA: a Package for Analysis of Rule-Based Classification Models

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support