R.ROSETTA: a package for analysis of rule-based classification models

Supplementary material

Mateusz Garbulowski1,*, Klev Diamanti1,#, Karolina Smolińska1,#, Patricia Stoll2, Susanne Bornelöv3, Aleksander Øhrn4 and Jan Komorowski1,*

1Department of Cell and Molecular Biology, Uppsala University, Sweden 2Department of Biosystems Science and Engineering, ETH Zurich, Switzerland 3Wellcome Trust Medical Research Council Stem Cell Institute, University of Cambridge, England 4Department of Informatics, Oslo University, Norway #These authors contributed equally to the work as second authors. *Corresponding author E-mail: [email protected] (Mateusz Garbulowski) E-mail: [email protected] (Jan Komorowski)

R.ROSETTA is freely available at: https://github.com/komorowskilab/R.ROSETTA

Table of contents

1. Supplementary notes ...... 2 1.1. Package architecture ...... 2 1.2. Main upgrades ...... 2 1.1.1 Undersampling ...... 2 1.1.2 Rule p-value estimation ...... 2 1.1.3 Votes normalization in class prediction ...... 2 1.1.4 Retrieval of support sets from rules ...... 2 1.1.5 Rule visualization ...... 2 1.3. Data preprocessing ...... 2 1.4. Feature selection ...... 3 1.5. Feature validation ...... 3 1.6. Classification ...... 3 2. Supplementary figures ...... 4 3. Supplementary tables ...... 9 4. Supplementary references ...... 17

1. Supplementary notes

1.1. Package architecture

The ROSETTA framework comes in a GUI version for Windows systems and a command line version for the UNIX-based systems. The R.ROSETTA package is a cross-platform application that uses command line ROSETTA. However, UNIX-based systems require installation of the compatibility layer software wine. For more information we recommend you read the original ROSETTA articles (Øhrn, 1999; Øhrn, 2000; Øhrn et al., 1998) and the technical reference manual (Øhrn, 2001). Detailed instructions for R.ROSETTA installation, functions and a sample code are available in the package manual.

1.2. Main upgrades

1.1.1 Undersampling Classification models are trained with labeled examples that are a priori assigned to a decision class. Ideally, each decision class shall contain approximately the same number of examples. However, decision classes that are not represented equally may lead to a biased performance of the model. In such cases, we suggested that a number of examples equal to the population of the minority class is randomly sampled without replacement a sufficient number of times. The approach of balancing the data (Liu et al., 2009) is generally known as undersampling. We have implemented an option that creates sets of equal sizes by undersampling the larger sets of the input dataset. By default, each example is selected at least once, although the user can set a custom number of sampled sets, as well as a custom size for each set. The final classification models for each undersampled set are merged into a single model that consist of unique rules from each classifier. The overall accuracy of the model is estimated as the mean value of all the submodels. Finally, all the statistics of each rule shall be recalculated according to the original input dataset.

1.1.2 Rule p-value estimation Classification models generated by R.ROSETTA consist of a set of varying number of rules estimated from different reduct computation algorithms. In case of the Johnson algorithm (Johnson, 1974) this set is manageable, while in case of the Genetic algorithm (Wróblewski, 1995) is considerably large (Supplementary Figure S3; Supplementary Table S4). In both cases, supervised pruning of rules from the models would not heavily affect the overall performance of the classifier. To better assess the quality of each rule we assume a hypergeometric distribution to calculate p-values followed by multiple testing correction. The hypergeometric distribution follows the idea of estimating the power of representation of rule support against the total number of examples. Such optimized models illustrate the essential interactions among features and can be applied to external data.

1.1.3 Votes normalization in class prediction The rule-based models allow straightforward class prediction. Rules can be applied for the classification of unseen data. We collected the votes from the rules for each object individually and examine the predominance of each class. For some models an imbalanced number of rules for the outcome was generated. We propose to adjust for the rule-imbalance by normalizing on vote counts. We implemented various vote normalization methods in R.ROSETTA. Votes normalization can be performed by dividing the number of votes by its mean, median, maximum, total number of rules or square root of the sum of squares. We compared the performance of these methods in (Supplementary Table S8). Yet an additional necessary condition to be fulfilled is that the validation set shall contain the same set of features as the training set. In this case the class prediction is performed by applying the rules to the data. If the condition is fulfilled, the votes are estimated.

1.1.4 Retrieval of support sets from rules R.ROSETTA is able to recover support sets that represent the contribution of the objects to rules (Supplementary Figure S4). As a result, each rule is characterized with the set of objects that fulfill the given IF or IF-THEN component. There are several advantages of knowing such information. Support sets contribute to uncovering objects whose levels of variables might have shared patterns. Such sets may be further investigated to uncover specific subsets within decision classes. Moreover, low accuracy support sets allow detecting objects that may potentially introduce a bias in the model.

1.1.5 Rule visualization The package provides several ways of rule visualization. One of these is a heatmap (Supplementary Figure S4) that illustrates the discrete levels of each feature, that is a member of the selected conjunctive rule for each object. The heatmap gathers the objects into those that belong to the support set for the given class, do not belong to the support set for the given class and the remaining objects for the other classes. The heatmap can assist the interpretation of individual rules of interest and visualization of interactions. A more holistic approach displays the entire model as an interaction network (Shmulevich et al., 2002). The package allows exporting the rules in a line-by-line format which is suitable with rule visualization software such as Ciruvis (Bornelöv et al., 2014) or VisuNet (Supplementary Figure S5) (Anyango, 2016). These approaches provide a different point of view of the rule-based models that allow discovering known proof-of-concept and novel interactions among the features (Dramiński et al., 2016; Enroth et al., 2012).

1.3. Data preprocessing

The expression data (Alter et al., 2011) for autistic and healthy young males (Supplementary Table S1) was downloaded from the GEO repository, under the accession number GSE25507. Expression levels of 54678 were measured with the Affymetrix

2 U133 Plus 2.0 array. The so-called autism-control dataset was loaded and processed using the getGEO function from the GEOquery R library (Davis and Meltzer, 2007). The data was normalized with the Robust Multi-array Average (RMA) functions ReadAffy and rma from the affy R package (Gautier et al., 2004). Gene names were annotated using AnnotationDbi (Pages et al., 2008) and hgu133plus2.db (Carlson, 2016) R packages. To identify unknown probe names, the annotation table HG-U133_Plus_2.na36 (http://www.affymetrix.com/site/mainPage.affx) was processed and the probe coordinates were intersected with the unknown probes using the GenomicRanges (Lawrence et al., 2013) R package. For one probe the gene name could not be identified due to the lack of coordinates. The clinical data was investigated for potential batch effects using Pearson correlation with the rcorr function from the Hmisc R library (Harrell and Dupont, 2008). The age of the samples was highly correlated to the outcome (Supplementary Figure S1, Supplementary Table S1). The sva R package (Leek et al., 2012) was used to correct the data for the age effect regarding the decision.

1.4. Feature selection

Decision table was ill-defined with the number of features (in this case genes) being much larger than the number of the samples. Dimensionality reduction of the autism-control dataset was performed with the Fast Correlation-Based Filter (FCBF) method (Yu and Liu, 2003) available as a function in the Biocomb R package (Novoselova et al., 2018). The feature selection method utilizes predominant correlation of the features and decision along with the redundancy. We choose FCBF as the method that selected the highest number of important features among the tested software (Supplementary Table S2). The other advantage of using FCBF is the compatibility of the discretization method with R.ROSETTA. The FCBF function was set to Equal Frequency discretization for 3 states. The method selected 35 features with Information Gain above 0 (Supplementary Table S3). This step can be alternatively performed with other dimensionality reduction methods such as: Monte Carlo Feature Selection (Dramiński et al., 2010; Dramiński and Koronacki, 2018), Boruta (Kursa and Rudnicki, 2010), Student's t-test, caret R package (Kuhn, 2008) and many others.

1.5. Feature validation We validated the gene list by exploring databases and articles reporting genes related to autism. We intersected the set of genes after feature selection with the genes listed in the commonly used databases of autism spectrum disorders (ASD) associations: SFARI (Abrahams et al., 2013), https://gene.sfari.org/, ASD (Krishnan et al., 2016) genes for p-value < 0.05 http://asd.princeton.edu/, AutDB (Basu et al., 2009) http://autism.mindspec.org/autdb/Welcome.do) (Supplementary Figure S2). Additionally, we searched for human genes information in Gene Card (Rebhan et al., 1997) and made a literature review to find genes that likely match ASD associations. We described sample genes that are likely associated with autism in the results section of main article. Additionally, we depicted genes that had been earlier linked to the brain or the nervous system such as: migraine, headache – PPOX (Makki et al., 2017), smell perception – OR51B5 (Chen et al., 2018) and ataxia – ATXN8OS (Moseley et al., 2006). We found that NCKAP5L is a gene likely to be involved in neurodevelopmental dysfunction in autism (Chahrour et al., 2012). We showed that expression of NCKAP5L gene is down-regulated in non- autistic patients (Supplementary Figure S2; Supplementary Figure S5). Among the most relevant features, we discovered a group of zinc fingers (Alter et al., 2011) such as ZSCAN18/ZNF447, ZFP36L2, KLF8/ZNF741 and a group of genes related to calcium homeostasis control (Palmieri et al., 2010) such as SCIN, NCS1, CAPS2. We also utilized the VisuNet framework that supports visualization and exploration of the rule-based models. We created a network from the Johnson-based model to illustrate the division of the subnetworks for autism and control (Supplementary Figure S5). The visualization allowed us to categorize genes in various ways. We observed nodes with many connections, that can be defined as hubs. We investigated networks (Supplementary Figure S5) for each class separately and pointed genes creating hubs, for autism: 234817_at, MAP7, NCKAP5L, NPR2, PPOX, ZSCAN18, and control: COX2, CSTB, PSMG4, RHPN1, SCIN, TSPOAP1, ZFP36L2. Additionally, the edge connection value determined the strongest interactions in the autism class: COX2-MAP7, COX2-RHPN1, CSTB-NCS1, NPR2-PSMG4, PSMG4-TSPOAP1, PSMG4-ZSCAN18, SCIN-ZFP36L2.

1.6. Classification Rough set-based classification is a transparent machine learning method that allows to construct rule-based models. The rules are often accredited by two fundamental measures of accuracy and support. Accuracy represents the correctness of the rule, while support shows number of examples the rule applies on. The final decision table was constructed upon 35 important genes, 146 objects/samples and the decision class (autism or control). The models were created using 10-fold Cross-Validation for the standard voter method, equal frequency discretization into 3 states, discernibility of the objects and Bonferroni method to adjust rule p-values for multiple testing. Undersampling was omitted with a view of not heavily unbalanced distribution of objects within classes (Supplementary Table S1). The paper focusses on Johnson (Johnson, 1974) and Genetic (Wróblewski, 1995) reduction methods. For simplicity of the model and rule significance reasons (Supplementary Figure S3, Supplementary Table S4) most of the interpretations were based on the Johnson reducer. The final Johnson-based model contained 29 genes shared between decision classes, while autism and controls contained 31 and 33 unique genes, respectively (Supplementary Figure S2).

3 2. Supplementary figures

Supplementary Figure S1. A heatmap of Pearson correlation r values derived from the clinical data. Stars represent the p-value significance level of the correlation values. Significance of the Pearson correlation is denoted by (*p < 0.05, **p < 0.01, ***p < 0.001). The decision column is detached with dark blue lines to illustrate the effect of the clinical data on the decision.

4

Supplementary Figure S2. A heatmap illustrating genes characteristic considering: levels of autism (GEA) and control (GEC), presence of the feature in rules for autism (RA) and control (RC), and intersection of autism-related genes with the databases: SFARI, AutDB and ASD. Features are sorted by Information Gain (IG). The gene expression level in autism is mostly unchanged (gray) or up (red), while for the control is mostly no change (gray) or down (green). The intersection shows that 11 genes is likely to be associated with autism.

5

a.

b.

c.

Supplementary Figure S3. Distributions of rule p-values for the rule reduction methods without correction (a), Bonferroni-corrected (b) and Bonferroni-corrected after rule recalculation (c). Histograms display the comparison of the autism-control model for the Genetic and Johnson reducer. Due to the large imbalance of the number of rules for the reducers, the frequency of rules was adjusted to the percentage. The model contained rules that were filter out according for support 1-5 and accuracy 0-0.6. A dashed gray line marks the 0.05 significance threshold.

6

a. b.

Supplementary Figure S4. Sample rule heatmaps representing the contribution of objects in a rule estimation. Heatmaps are displayed for one of the most significant rules for each class of the autism-control model. The rules were generated by the Johnson reducer and recalculated. The significance level is denoted by (*p < 0.05, **p < 0.01, ***p < 0.001). The rule is displayed under each plot. The colors on the heatmap symbolize three levels of gene expression: down-regulated (green), unchanged (gray) and up-regulated (red). The numbers in the round brackets after the gene name correspond to the level of gene expression: down-regulated (1), unchanged (2) and up-regulated (3).

7

a. b.

c.

Supplementary Figure S5. A feature interaction networks generated by the VisuNet framework. The network is displayed for the rules of different outcomes: control (a), autism (b) and combined control-autism (c). The colors of nodes symbolize three levels of gene expression: down-regulated (green), unchanged (gray) and up-regulated (red). The colors of the edges symbolize the strength of the connection: gray (weak), yellow (marginally strong), orange (strong), red (very strong). The thickness of the blue border on the node corresponds to the number of rules. The model contained rules that were filter out according for support 1-5 and accuracy 0-0.6. The rules were generated by the Johnson reducer and model was recalculated. The plot was generated with the default parameters of VisuNet. A network (c) is a characteristic example of a class division into subnetworks.

8 3. Supplementary tables

Supplementary Table S1. A correlation between the outcome and clinical data of the autism-control dataset. The p-values were estimated with the Student's t-test. No significant differences are shown between batches and parental ages. The highly significant correlation for the age of subjects and outcome was identified.

control autism p-value Number of the samples 64 82 - Subject age 7.9±2.1 5.5±2.1 4.26´10-9 Maternal age 29.7±5 30.1±5.6 0.71 Paternal age 30.3±5.3 31.5±6.3 0.27 Batch B1(32), B2(32) B1(36), B2(46) 0.46

9 Supplementary Table S2. A comparison of the performance of 4 dimensionality reduction methods applied on the autism-control dataset. For Boruta and Student’s t-test, 0.05 p-value threshold was established. Fast correlation-based filter (FCBF) method selected genes with the Information Gain greater than 0. Monte Carlo Feature Selection (MCFS) estimated a Relative Importance (RI) threshold with critical angle method.

Boruta FCBF MCFS Student’s t-test R package Boruta Biocomb rmcfs stats Threshold p < 0.05 IG > 0 RI > 0.036 p < 0.05 Number of features 12 35 16 13 Data discretization No Yes No No

10 Supplementary Table S3. The result of the Fast Correlation-Based Filter (FCBF) method applied on the autism- control dataset. The list is decreasingly sorted from the features with the highest Information Gain (IG). The position of each feature in a ranking is given in the first column. The translation between microarray probe ID and gene ID is shown in the last two columns.

position IG gene ID probe ID 1 0,10860441 MAP7 202890_at 2 0,0995046 COX2 1553569_at 3 0,094478 NCKAP5L 1562457_at 4 0,08567925 ZSCAN18 217593_at 5 0,08391652 RHPN1 235998_at 6 0,07773368 PPOX 238118_s_at 7 0,07625543 NPR2 204310_s_at 8 0,07608236 NCS1 222570_at 9 0,07128859 PSMG4 233443_at 10 0,06793192 SCIN 1552367_a_at 11 0,06564102 CSTB 236449_at 12 0,06244449 TSPOAP1 205839_s_at 13 0,06222545 TCP11L1 205796_at 14 0,06185506 234817_at 234817_at 15 0,06116627 TMLHE-AS1 1560797_s_at 16 0,0606206 PSMD4 200882_s_at 17 0,05976664 ZFP36L2 201367_s_at 18 0,05938284 B3GNT7 1555962_at 19 0,05752864 MSI2 225238_at 20 0,05732106 CAPS2 224370_s_at 21 0,05700458 MIR646HG 1562051_at 22 0,05581372 CLDN17 221328_at 23 0,05510055 BAHD1 203051_at 24 0,05288485 OR51B5 1570516_s_at 25 0,050778 C11orf95 218641_at 26 0,0490634 ATXN8OS 216404_at 27 0,04738852 NRG2 242303_at 28 0,04694185 LOC400655 216703_at 29 0,04694185 GJA9 221415_s_at 30 0,0446669 VPS8 234028_at 31 0,04336461 FLRT2 240259_at 32 0,0388937 C1QTNF7 239349_at 33 0,03606374 KLF8 219930_at 34 0,03568021 CWF19L2 1566515_at 35 0,03116567 DEPDC1 222958_s_at

11 Supplementary Table S4. A comparison of the efficiency of the Johnson and Genetic reducers for basic (not recalculated) and recalculated rules. The data was discretized using the Equal Frequency method and the model was constructed with 10 Cross-Validation of the standard voter classification method. Models contained rules that were filter out according for support 1-5 and accuracy 0-0.6 and with Bonferroni-corrected p-values.

Johnson Genetic Mean accuracy 82% 91% Mean AUC 0.891 0.996 The total number of rules 189 25571 Basic rules Number of rules *p < 0.05 112 12 Number of rules **p < 0.01 80 2 Number of rules ***p < 0.001 28 1 Recalculated rules Number of rules *p < 0.05 134 56 Number of rules **p < 0.01 96 30 Number of rules ***p < 0.001 48 6

12 Supplementary Table S5. An overall statistic for the Johnson reducer model of the autism-control dataset. The content was inspected for basic (not recalculated) and recalculated rules. The Bonferroni correction was applied on the p-values. A rule ratio for the given class was calculated as the number of rules for the given class divided by the number of rules for the opposite class.

control autism Number of rules 79 110 Rule ratio 0.72 1.39 Basic rules Mean support 14.76 15.4 Mean accuracy 0.94 0.99 Lowest p-value 5.14´10-6 6.70´10-5

Highest ranked interaction ZSCAN18-NPR2 MAP7- COX2

Recalculated rules

Mean support 16.24 16.95

Mean accuracy 0.91 0.96

Lowest p-value 6.36´10-7 6.70´10-5

Highest ranked interactions ZSCAN18-NPR2 PSMG4-TSPOAP1 NCS1-CSTB

13 Supplementary Table S6. Lists of features retrieved from the rules of the recalculated Johnson model. The genes were selected for Bonferroni-corrected p-value < 0.001. The table contains: genes from rules for autism class, genes from rules for control class, genes from intersection of autism and control genes, genes used only to classify autism and genes used only to classify control. Genes that may suggest association with autism (Main article – Results, Supplementary Note – Feature Validation, Supplementary Figure S2) are marked as a red bold font. The proportion of autism-related genes is given below each list. Genes are decreasingly sorted by frequency and rule p-value.

Intersection Unique for Unique for autism control autism and control autism control Features TSPOAP1 MAP7 TSPOAP1 RHPN1 NCKAP5L COX2 NCKAP5L COX2 FLRT2 PPOX NCS1 NPR2 NCS1 TMLHE-AS1 234817_at RHPN1 PPOX CSTB ZFP36L2 ZSCAN18 CSTB 234817_at PSMG4 MIR646HG ATXN8OS PSMG4 ZSCAN18 MAP7 TCP11L1 CAPS2 MAP7 ATXN8OS SCIN BAHD1 MSI2 FLRT2 CAPS2 NPR2 OR51B5 SCIN MSI2 B3GNT7 TMLHE-AS1 OR51B5 C1QTNF7 ZFP36L2 TSPOAP1 CLDN17 MIR646HG B3GNT7 VPS8 TCP11L1 C1QTNF7 LOC400655 BAHD1 CLDN17 KLF8 NPR2 NCS1 C11orf95 VPS8 PSMD4 PSMG4 LOC400655 SCIN KLF8 C11orf95 COX2 CSTB PSMD4 Proportion 67% 50% 63% 71% 44%

14 Supplementary Table S7. A review of top 5 rules for autism and control class. The rules were generated from a recalculated Johnson model with Bonferroni-adjusted p-values and filtration for support 1-5 and accuracy 0-0.6. The table shows: ID of the order of rules, rules in the IF-THEN format, rule accuracy, rule support and p-value. The rules are decreasingly sorted by p-value.

ID Rule Accuracy Support p-value control 1 IF ZSCAN18(down) AND NPR2(no change) THEN control 1 21 6.36´10-7 2 IF MAP7(up) AND NCKAP5L(down) THEN control 0.96 23 1.21´10-6 3 IF NPR2(no change) AND CAPS2(up) THEN control 1 19 1.82 ´10-6 4 IF MAP7(up) AND ATXN8OS(down) THEN control 1 19 5.14´10-6 5 IF NCKAP5L(down) AND 234817_at(down) THEN control 0.95 21 9.40´10-6

autism

14 IF PSMG4(up) AND TSPOAP1(up) THEN autism 1 23 6.70´10-5 15 IF NCS1(no change) AND CSTB(down) THEN autism 1 23 6.70´10-5 23 IF MAP7(no change) AND COX2(up) THEN autism 0.96 26 1.07´10-4 24 IF COX2(up) THEN autism 0.85 40 1.14´10-4 25 IF RHPN1(up) AND SCIN(no change) THEN autism 1 22 1.38´10-4

15 Supplementary Table S8. The performance of vote normalization methods in reclassifying the autism-control dataset. Several ways of p-value based filtration were investigated for Johnson and Genetic reducers. The vote counts were normalized by different factors: median, mean, maximum (max), square root of the sum of squares (srss) or rule number (rulnum). The values represent accuracy measures.

Normalization Mean none median mean max srss rulnum method accuracy

Johnson 0.59 0.86 0.87 0.64 0.86 0.62 0.74 *p < 0.05 Genetic 0.49 0.69 0.85 0.60 0.81 0.69 0.69 *p < 0.05 Johnson 0.72 0.86 0.88 0.64 0.86 0.62 0.76 **p < 0.01 Genetic 0.84 0.49 0.87 0.58 0.75 0.57 0.68 **p < 0.01 Johnson 0.88 0.66 0.88 0.66 0.85 0.67 0.77 ***p < 0.001 Genetic ------***p < 0.001

Mean accuracy 0.70 0.71 0.87 0.62 0.83 0.63 -

16 4. Supplementary references

Abrahams, B.S. et al. (2013) SFARI Gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs). Mol Autism, 4.1, 36. Alter, M.D. et al. (2011) Autism and increased paternal age related changes in global levels of gene expression regulation. PloS one, 6.2, e16715. Anyango, S.O.O. (2016) VisuNet: Visualizing Networks of feature interactions in rule-based classifiers. Basu, S.N. et al. (2009) AutDB: a gene reference resource for autism research. Nucleic Acids Res, 37.Database issue, D832-836. Bornelöv, S. et al. (2014) Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. 15.1, 139. Carlson, M. (2016) hgu95av2. db: Affymetrix Human Genome U95 Set annotation data (chip hgu95av2). Chahrour, M.H. et al. (2012) Whole-exome sequencing and homozygosity analysis implicate depolarization-regulated neuronal genes in autism. PLoS genetics, 8.4, e1002635. Chen, Z. et al. (2018) The diversified function and potential therapy of ectopic olfactory receptors in non-olfactory tissues. Journal of cellular physiology, 233.3, 2104-2115. Davis, S. and Meltzer, P.S. (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23.14, 1846-1847. Dramiński, M. et al. (2016) Discovering networks of interdependent features in high-dimensional problems. Big Data Analysis: New Algorithms for a New Society, 285-304. Dramiński, M. et al. (2010) Monte Carlo feature selection and interdependency discovery in supervised classification. In, Advances in Machine Learning II. Springer, 371-385. Dramiński, M. and Koronacki, J. (2018) rmcfs: An R Package for Monte Carlo Feature Selection and Interdependency Discovery. Journal of Statistical Software, 85.1, 1-28. Enroth, S. et al. (2012) Combinations of histone modifications mark exon inclusion levels. PloS one, 7.1, e29911. Gautier, L. et al. (2004) affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20.3, 307-315. Harrell, F.E. and Dupont, C. (2008) Hmisc: harrell miscellaneous. Release R package version 4.1-1. https://CRAN.R- project.org/package=Hmisc. (2019-01-25 last accessed). Johnson, D.S. (1974) Approximation algorithms for combinatorial problems. Journal of computer system sciences, 9.3, 256-278. Krishnan, A. et al. (2016) Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature neuroscience, 19.11, 1454. Kuhn, M. (2008) Building predictive models in R using the caret package. Journal of statistical software, 28.5, 1-26. Kursa, M.B. and Rudnicki, W.R. (2010) Feature selection with the Boruta package. J Stat Softw, 36.11, 1-13. Lawrence, M. et al. (2013) Software for computing and annotating genomic ranges. PLoS computational biology, 9.8, e1003118. Leek, J.T. et al. (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28.6, 882-883. Liu, X.-Y. et al. (2009) Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, Cybernetics, Part B, 39.2, 539-550. Makki, A.Y. et al. (2017) An Unusual Cause of Headache and Fatigue in a Division 1 Collegiate Athlete. Clinical Journal of Sport Medicine, 27.4, e58. Moseley, M.L. et al. (2006) Bidirectional expression of CUG and CAG expansion transcripts and intranuclear polyglutamine inclusions in spinocerebellar ataxia type 8. Nature genetics, 38.7, 758. Novoselova, N. et al. (2018) Biocomb: Feature Selection and Classification with the Embedded Validation Procedures for Biomedical Data Analysis. Release R Package Version 0.4. https://CRAN.R-project.org/package=Biocomb. (1 October 2018 last accessed). Øhrn, A. (1999) Discernibility and rough sets in medicine: tools and application. Norwegian University of Science Technology, Norway, 41- 51. Øhrn, A. (2000) The Rosetta C++ Library: Overview of files and classes. Department of Computer Information Science, Norwegian University of Science Technology , Trondheim, Norway. Øhrn, A. (2001) Rosetta technical reference manual. Department of Computer Information Science, Norwegian University of Science Technology, Trondheim, Norway. Øhrn, A. et al. (1998) The design and implementation of a knowledge discovery toolkit based on rough sets-The ROSETTA system. In. Pages, H. et al. (2008) Annotation database interface. R package version, 1.2. Palmieri, L. et al. (2010) Altered calcium homeostasis in autism-spectrum disorders: evidence from biochemical and genetic studies of the mitochondrial aspartate/glutamate carrier AGC1. Molecular psychiatry, 15.1, 38. Rebhan, M. et al. (1997) GeneCards: integrating information about genes, and diseases. Trends in Genetics, 13.4, 163. Shmulevich, I. et al. (2002) Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18.2, 261-274. Wróblewski, J. (1995) Finding minimal reducts using genetic algorithms. Proccedings of the second annual join conference on infromation science, 2186-189.

17