Rapidminer-4.3-Tutorial.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
RapidMiner 4.3 User Guide Operator Reference Developer Tutorial 2 Rapid-I GmbH Stockumer Str. 475 44227 Dortmund, Germany http://www.rapidminer.com/ Copyright 2001-2008 by Rapid-I November 22, 2008 Contents 1 Introduction 27 1.1 Modeling Knowledge Discovery Processes as Operator Trees.. 28 1.2 RapidMiner as a Data Mining Interpreter........... 28 1.3 Different Ways of Using RapidMiner .............. 30 1.4 Multi-Layered Data View Concept................ 30 1.5 Transparent Data Handling.................... 31 1.6 Meta Data............................. 31 1.7 Large Number of Built-in Data Mining Operators........ 31 1.8 Extending RapidMiner ..................... 32 1.9 Example Applications....................... 33 1.10 How this tutorial is organized................... 34 2 Installation and starting notes 35 2.1 Download............................. 35 2.2 Installation............................. 35 2.2.1 Installing the Windows executable............ 35 2.2.2 Installing the Java version (any platform)........ 36 2.3 Starting RapidMiner ...................... 36 2.4 Memory Usage........................... 38 2.5 Plugins............................... 38 2.6 General settings.......................... 38 2.7 External Programs......................... 39 2.8 Database Access.......................... 39 3 4 CONTENTS 3 First steps 43 3.1 First example........................... 43 3.2 Process configuration files..................... 46 3.3 Parameter Macros......................... 47 3.4 File formats............................ 48 3.4.1 Data files and the attribute description file........ 49 3.4.2 Model files......................... 53 3.4.3 Attribute construction files................ 53 3.4.4 Parameter set files..................... 54 3.4.5 Attribute weight files................... 54 3.5 File format summary....................... 55 4 Advanced processes 57 4.1 Feature selection.......................... 57 4.2 Splitting up Processes....................... 59 4.2.1 Learning a model..................... 59 4.2.2 Applying the model.................... 59 4.3 Parameter and performance analysis............... 61 4.4 Support and tips.......................... 64 5 Operator reference 67 5.1 Basic operators.......................... 68 5.1.1 ModelApplier....................... 68 5.1.2 ModelGrouper....................... 68 5.1.3 ModelUngrouper...................... 69 5.1.4 ModelUpdater....................... 70 5.1.5 OperatorChain....................... 70 5.2 Core operators........................... 72 5.2.1 CommandLineOperator.................. 72 5.2.2 DataMacroDefinition................... 73 5.2.3 Experiment........................ 74 November 22, 2008 CONTENTS 5 5.2.4 FileEcho.......................... 75 5.2.5 IOConsumer........................ 76 5.2.6 IOMultiplier........................ 76 5.2.7 IORetriever........................ 77 5.2.8 IOSelector......................... 78 5.2.9 IOStorer.......................... 79 5.2.10 MacroDefinition...................... 80 5.2.11 MaterializeDataInMemory................. 81 5.2.12 MemoryCleanUp...................... 81 5.2.13 Process.......................... 82 5.2.14 SQLExecution....................... 83 5.2.15 SingleMacroDefinition................... 83 5.3 Input/Output operators...................... 85 5.3.1 AccessExampleSource................... 85 5.3.2 ArffExampleSetWriter................... 86 5.3.3 ArffExampleSource.................... 86 5.3.4 AttributeConstructionsLoader............... 88 5.3.5 AttributeConstructionsWriter............... 89 5.3.6 AttributeWeightsLoader.................. 90 5.3.7 AttributeWeightsWriter.................. 90 5.3.8 BibtexExampleSource................... 91 5.3.9 C45ExampleSource.................... 92 5.3.10 CSVExampleSetWriter.................. 94 5.3.11 CSVExampleSource.................... 95 5.3.12 CachedDatabaseExampleSource............. 96 5.3.13 ChurnReductionExampleSetGenerator.......... 98 5.3.14 ClusterModelReader.................... 99 5.3.15 ClusterModelWriter.................... 99 5.3.16 DBaseExampleSource................... 100 5.3.17 DatabaseExampleSetWriter................ 100 The RapidMiner 4.3 Tutorial 6 CONTENTS 5.3.18 DatabaseExampleSource................. 101 5.3.19 DirectMailingExampleSetGenerator............ 104 5.3.20 ExampleSetGenerator................... 104 5.3.21 ExampleSetWriter..................... 105 5.3.22 ExampleSource...................... 107 5.3.23 ExcelExampleSetWriter.................. 109 5.3.24 ExcelExampleSource.................... 110 5.3.25 GnuplotWriter....................... 111 5.3.26 IOContainerReader.................... 112 5.3.27 IOContainerWriter..................... 113 5.3.28 IOObjectReader...................... 113 5.3.29 IOObjectWriter...................... 114 5.3.30 MassiveDataGenerator.................. 115 5.3.31 ModelLoader........................ 115 5.3.32 ModelWriter........................ 116 5.3.33 MultipleLabelGenerator.................. 117 5.3.34 NominalExampleSetGenerator............... 118 5.3.35 ParameterSetLoader.................... 119 5.3.36 ParameterSetWriter.................... 119 5.3.37 PerformanceLoader.................... 120 5.3.38 PerformanceWriter.................... 121 5.3.39 ResultWriter........................ 121 5.3.40 SPSSExampleSource................... 122 5.3.41 SalesExampleSetGenerator................ 123 5.3.42 SimpleExampleSource................... 124 5.3.43 SparseFormatExampleSource............... 126 5.3.44 StataExampleSource................... 127 5.3.45 TeamProfitExampleSetGenerator............. 128 5.3.46 ThresholdLoader...................... 129 5.3.47 ThresholdWriter...................... 129 November 22, 2008 CONTENTS 7 5.3.48 TransfersExampleSetGenerator.............. 130 5.3.49 UpSellingExampleSetGenerator.............. 131 5.3.50 WekaModelLoader..................... 131 5.3.51 XrffExampleSetWriter................... 132 5.3.52 XrffExampleSource.................... 133 5.4 Learning schemes......................... 136 5.4.1 AdaBoost......................... 136 5.4.2 AdditiveRegression.................... 137 5.4.3 AgglomerativeClustering................. 138 5.4.4 AssociationRuleGenerator................. 139 5.4.5 AttributeBasedVote.................... 140 5.4.6 Bagging.......................... 140 5.4.7 BasicRuleLearner..................... 141 5.4.8 BayesianBoosting..................... 142 5.4.9 BestRuleInduction..................... 144 5.4.10 Binary2MultiClassLearner................. 145 5.4.11 CHAID........................... 146 5.4.12 ClassificationByRegression................ 147 5.4.13 ClusterModel2ExampleSet................ 148 5.4.14 CostBasedThresholdLearner................ 149 5.4.15 DBScanClustering..................... 150 5.4.16 DecisionStump...................... 151 5.4.17 DecisionTree........................ 152 5.4.18 DefaultLearner....................... 153 5.4.19 EvoSVM.......................... 154 5.4.20 ExampleSet2ClusterModel................ 156 5.4.21 ExampleSet2Similarity................... 157 5.4.22 ExampleSet2SimilarityExampleSet............ 157 5.4.23 FPGrowth......................... 158 5.4.24 FlattenClusterModel.................... 160 The RapidMiner 4.3 Tutorial 8 CONTENTS 5.4.25 GPLearner......................... 160 5.4.26 HyperHyper........................ 161 5.4.27 ID3............................. 162 5.4.28 ID3Numerical....................... 163 5.4.29 IteratingGSS........................ 164 5.4.30 JMySVMLearner...................... 166 5.4.31 KMeans.......................... 168 5.4.32 KMedoids......................... 168 5.4.33 KernelKMeans....................... 169 5.4.34 KernelLogisticRegression................. 171 5.4.35 LibSVMLearner...................... 172 5.4.36 LinearRegression...................... 174 5.4.37 LogisticRegression..................... 175 5.4.38 MetaCost......................... 176 5.4.39 MultiCriterionDecisionStump............... 177 5.4.40 MyKLRLearner...................... 178 5.4.41 NaiveBayes........................ 179 5.4.42 NearestNeighbors..................... 180 5.4.43 NeuralNet......................... 181 5.4.44 OneR........................... 183 5.4.45 Perceptron......................... 183 5.4.46 PolynomialRegression................... 184 5.4.47 PsoSVM.......................... 185 5.4.48 RVMLearner........................ 187 5.4.49 RandomFlatClustering................... 188 5.4.50 RandomForest....................... 189 5.4.51 RandomTree........................ 190 5.4.52 RelativeRegression..................... 191 5.4.53 RelevanceTree....................... 192 5.4.54 RuleLearner........................ 193 November 22, 2008 CONTENTS 9 5.4.55 Similarity2ExampleSet................... 194 5.4.56 Stacking.......................... 195 5.4.57 SubgroupDiscovery.................... 196 5.4.58 SupportVectorClustering................. 197 5.4.59 TopDownClustering.................... 198 5.4.60 TransformedRegression.................. 199 5.4.61 Tree2RuleConverter.................... 200 5.4.62 Vote............................ 201 5.4.63 W-ADTree......................... 202 5.4.64 W-AODE......................... 203 5.4.65 W-AODEsr........................ 204 5.4.66 W-AdaBoostM1...................... 205 5.4.67 W-AdditiveRegression................... 206 5.4.68 W-Apriori......................... 207 5.4.69 W-BFTree......................... 208 5.4.70 W-BIFReader....................... 209 5.4.71 W-Bagging........................ 210 5.4.72 W-BayesNet........................ 211 5.4.73 W-BayesNetGenerator................... 212 5.4.74 W-BayesianLogisticRegression.............