WEKA & KNIME Open Source Machine Learning Tools
Total Page:16
File Type:pdf, Size:1020Kb
WEKA & KNIME Open Source Machine Learning Tools Abd-ur-Rehman Sajid Mahmood Agenda • Introduction • List of Open Source Machine Learning Tools – WEKA – KNIME • Supported Formats by WEKA & KNIME – CSV – ARFF • Techniques presented • Data Sets Used • Demonstration Introduction • Open source softwares becoming increasingly accepted. • Variety of open source Machine Learning tools available • Equally popular in both researchers and practitioners. • Increasing demand for integrated environments to experiment and evaluate Machine Learning algorithms • Weka 3, Data Mining Software in Java • KNIME, Konstanz Information Miner (Java) • D2K, Data to Knowledge (Java) • RapidMiner (formerly YALE, Yet Another Learning Environment) (Java) • Orange, a component-based data mining software (C++) • MLC++ is a library of C++ classes for supervised machine learning #4 WEKA: Main Features • 49 data preprocessing tools • 76 classification/regression algorithms • 8 clustering algorithms • 10 feature selection algorithms • 3 algorithms for finding association rules • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) – “The Experimenter” (experimental environment) – “The KnowledgeFlow” (new process model inspired interface) WEKA Purpose • Used for research, education, and applications • Main features: – Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods – Graphical user interfaces (incl. data visualization) – Environment for comparing learning algorithms • Can be used in two different ways: – User approach • Experimental & Explorer options – Developmental approach • Using compressed library source code 6 User Approach • The explorer view allows options for: – Import Data • from files in various formats or from URL or an SQL database (using JDBC) – Pre-processing • tools in WEKA are called “filters” – Classification • Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets – Clustering • k-Means, EM, Cobweb, X-means, FarthestFirst – Associations • Contains a version of the Apriori algorithm, works only with discrete data 7 Supported File Formats • CSV • ARFF • URL • Database using jdbc connection Flat file in .CSV format (Heart-Disease) Age, sex, chest_pain_type, cholesterol, exercise_induced_angina,class 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present Flat file in .ARFF format (Heart-Disease) • WEKA only deals with flat files, e.g., @relation heart-disease @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present 13 KNIME: Interactive Data Exploration Features: . Modular Data Pipeline Environment . Large collection of Data Mining techniques . Data and Model Visualizations . Interactive Views on Data and Models . Java Code Base as Open Source Project .Integration with: R Library, Weka, etc. Based on the Eclipse Plug-in technology Easy extendibility New nodes via open API and integrated wizard #14 Data Sets Used • Manually Generated – 2 features – 3 classes – 10 instances per class • Iris Data Set – 4 features – 3 classes – 50 instances per class Manually Generated X Y class X Y class X Y class 2.2 2.9 c1 7.2 2.9 c2 7.2 7.9 c3 3.1 2.1 c1 7.9 2.1 c2 8.1 7.1 c3 2.5 2.9 c1 7.5 2.9 c2 7.5 7.9 c3 2.6 3.3 c1 7.6 3.3 c2 7.6 8.3 c3 2.5 2.1 c1 7.5 2.1 c2 7.5 7.1 c3 2.8 2.6 c1 7.8 2.6 c2 7.8 7.6 c3 3 2.4 c1 7.4 2.4 c2 8 7.4 c3 3.1 3.1 c1 8.1 3.1 c2 7.4 8.1 c3 2.8 3.1 c1 7.8 3.1 c2 7.8 8.1 c3 3.1 3.3 c1 8.1 3.3 c2 7.3 8.3 c3 9 8 7 6 5 Series1 Series2 4 Series3 3 2 1 0 0 1 2 3 4 5 6 7 8 9 Sepal Sepal Petal Petal Sepal Sepal Petal Petal Sepal Sepal Petal Petal Class Class Class Length Width Length Width Length Width Length Width Length Width Length Width 5.1 3.5 1.4 0.2 Iris-setosa 7 3.2 4.7 1.4 Iris-versicolor 6.3 3.3 6 2.5 Iris-virginica 4.9 3 1.4 0.2 Iris-setosa 6.4 3.2 4.5 1.5 Iris-versicolor 5.8 2.7 5.1 1.9 Iris-virginica 4.7 3.2 1.3 0.2 Iris-setosa 6.9 3.1 4.9 1.5 Iris-versicolor 7.1 3 5.9 2.1 Iris-virginica 4.6 3.1 1.5 0.2 Iris-setosa 5.5 2.3 4 1.3 Iris-versicolor 6.3 2.9 5.6 1.8 Iris-virginica 5 3.6 1.4 0.2 Iris-setosa 6.5 2.8 4.6 1.5 Iris-versicolor 6.5 3 5.8 2.2 Iris-virginica 5.4 3.9 1.7 0.4 Iris-setosa 5.7 2.8 4.5 1.3 Iris-versicolor 7.6 3 6.6 2.1 Iris-virginica 4.6 3.4 1.4 0.3 Iris-setosa 6.3 3.3 4.7 1.6 Iris-versicolor 4.9 2.5 4.5 1.7 Iris-virginica 5 3.4 1.5 0.2 Iris-setosa 4.9 2.4 3.3 1 Iris-versicolor 7.3 2.9 6.3 1.8 Iris-virginica 4.4 2.9 1.4 0.2 Iris-setosa 6.6 2.9 4.6 1.3 Iris-versicolor 6.7 2.5 5.8 1.8 Iris-virginica 4.9 3.1 1.5 0.1 Iris-setosa 5.2 2.7 3.9 1.4 Iris-versicolor 7.2 3.6 6.1 2.5 Iris-virginica 5.4 3.7 1.5 0.2 Iris-setosa 5 2 3.5 1 Iris-versicolor 6.5 3.2 5.1 2 Iris-virginica 4.8 3.4 1.6 0.2 Iris-setosa 5.9 3 4.2 1.5 Iris-versicolor 6.4 2.7 5.3 1.9 Iris-virginica 4.8 3 1.4 0.1 Iris-setosa 6 2.2 4 1 Iris-versicolor 6.8 3 5.5 2.1 Iris-virginica 4.3 3 1.1 0.1 Iris-setosa 6.1 2.9 4.7 1.4 Iris-versicolor 5.7 2.5 5 2 Iris-virginica 5.8 4 1.2 0.2 Iris-setosa 5.6 2.9 3.6 1.3 Iris-versicolor 5.8 2.8 5.1 2.4 Iris-virginica 5.7 4.4 1.5 0.4 Iris-setosa 6.7 3.1 4.4 1.4 Iris-versicolor 6.4 3.2 5.3 2.3 Iris-virginica 5.4 3.9 1.3 0.4 Iris-setosa 5.6 3 4.5 1.5 Iris-versicolor 6.5 3 5.5 1.8 Iris-virginica 5.1 3.5 1.4 0.3 Iris-setosa 5.8 2.7 4.1 1 Iris-versicolor 7.7 3.8 6.7 2.2 Iris-virginica 5.7 3.8 1.7 0.3 Iris-setosa 6.2 2.2 4.5 1.5 Iris-versicolor 7.7 2.6 6.9 2.3 Iris-virginica 5.1 3.8 1.5 0.3 Iris-setosa 5.6 2.5 3.9 1.1 Iris-versicolor 6 2.2 5 1.5 Iris-virginica 5.4 3.4 1.7 0.2 Iris-setosa 5.9 3.2 4.8 1.8 Iris-versicolor 6.9 3.2 5.7 2.3 Iris-virginica 5.1 3.7 1.5 0.4 Iris-setosa 6.1 2.8 4 1.3 Iris-versicolor 5.6 2.8 4.9 2 Iris-virginica 4.6 3.6 1 0.2 Iris-setosa 6.3 2.5 4.9 1.5 Iris-versicolor 7.7 2.8 6.7 2 Iris-virginica 5.1 3.3 1.7 0.5 Iris-setosa 6.1 2.8 4.7 1.2 Iris-versicolor 6.3 2.7 4.9 1.8 Iris-virginica 4.8 3.4 1.9 0.2 Iris-setosa 6.4 2.9 4.3 1.3 Iris-versicolor 6.7 3.3 5.7 2.1 Iris-virginica Algorithm Presented • Decision trees – C4.5 • Clustering – K-Means • Classification – Naïve Bays References and Resources • References: – WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html – WEKA Tutorial: • Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. • A presentation which explains how to use Weka for exploratory data mining. – WEKA Data Mining Book: • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) – WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page – Others: • Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd ed. Demonstration #22 Drag & Drop Nodes from Repository to Workbench #23 Configure Nodes individually #24 Configure Nodes individually #25 Connect Nodes via Simple dragging #26 Connect Nodes via Simple dragging #27 #28 Execute one or more nodes #29 #30 Open individual views per node #31 #32 Mark (hilite) selected points #33 HiLiting also spreads to other views #34 Many more views and also other types available… #35.