Statistical Analysis on Iris Data

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Analysis on Iris Data Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Getting Started with for newbies STATISTICAL ANALYSIS ON IRIS DATA 14 October 2018 Dr. Norhaiza Ahmad Department of Mathematical Sciences Faculty of Science Universiti Teknologi Malaysia http://science.utm.my/norhaiza/ Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad IRIS DATASET Iris flower data set is a collection of data to quantifythe morphologicvariation of Iris flowers. The flowers were collected in the Gaspé Peninsula from the same pasture, and picked on the same dayand measured at the same time by the same person with the same apparatus. The data set consists of 50 samples from each of three species of Iris setosa (far left), Iris versicolor (centre) and Iris virginica). Four components of the flowers’ features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset •The iris data is included in the R base package as a dataframe. > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa . 50 5.0 3.3 1.4 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor . 100 5.7 2.8 4.1 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica . 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad TASK Call up IRIS dataset on R and analyze the dataset using the codes given to you. Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset iris #multivariate data on flower measurements head(iris) > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa tail(iris) Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 #mean and median appear close- indication data symmetric Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset #Say we want to analyse species versicolor only #Create subset of Species versicolor iris.vs = iris[51:100,1:4] #or iris.vs =iris[iris$Species=="versicolor",1:4] #mean and median appear close- indication data symmetric names(iris.vs) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset #DISPLAY DATA or hist(iris.vs[,1]) #tidy hist(iris.vs[,1],main=names(iris)[1],xlab=NULL) #change layout of graphs par(mfrow=c(1,2)) #1 row, 2 col layout hist(iris.vs[,1]) hist(iris.vs[,1],main=names(iris)[1],xlab=NULL) Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset #DISPLAY DATA or #Graphs for all measurements of iris versicolor par(mfrow=c(2,2)) hist(iris.vs[,1],main=names(iris)[1],xlab=NULL) hist(iris.vs[,2],main=names(iris)[2],xlab=NULL) hist(iris.vs[,3],main=names(iris)[3],xlab=NULL) hist(iris.vs[,4],main=names(iris)[4],xlab=NULL) #multi scatter-plots between variables > pairs(iris.vs) Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset #Correlation between variables or cor(iris.vs) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 0.5259107 0.7540490 0.5464611 Sepal.Width 0.5259107 1.0000000 0.5605221 0.6639987 Petal.Length 0.7540490 0.5605221 1.0000000 0.7866681 Petal.Width 0.5464611 0.6639987 0.7866681 1.0000000 #pairs Sepal.Length vs Petal.Length, and Petal.Length vs Petal Width #are most strongly correlated with respective correlations of #0.7540 and 0.7867 #Correlation test between Petal.Length vs Petal Width cor.test(iris.vs[,3],iris.vs[,4]) Pearson's product-moment correlation data: iris.vs[, 3] and iris.vs[, 4] t = 8.828, df = 48, p-value = 1.272e-11 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6508311 0.8737034 sample estimates: cor #Significant linear correlation 0.7866681 Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset or #how to build a statistical model to predict a new Petal width given a new petal length? Use simple linear regression model (irisVS.lm=lm(iris.vs[,3]~iris.vs[,4])) Call: lm(formula = iris.vs[, 3] ~ iris.vs[, 4]) Coefficients: (Intercept) iris.vs[, 4] 1.781 1.869 #Petal width= 1.781+1.869*Petal Length Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Iris dataset #Is there a difference between the average Petal Length of species Setosa and Versicolor? #assume the data are normally distributed iris.s =iris[iris$Species==”setosa",1:4] t.test(iris.s[,3],iris.vs[,3]) > t.test(iris.s[,3],iris.vs[,3]) Welch Two Sample t-test data: iris.s[, 3] and iris.vs[, 3] t = -39.4927, df = 62.14, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.939618 -2.656382 sample estimates: mean of x mean of y 1.462 4.260 #Significant evidence to Reject the null hypothesis that there is no difference between the average petal length of species iris Setosa & Versicolor Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Use package ggplot2 : IRIS #------------# Advanced- Use package ggplot2 : IRIS library(ggplot2) p1 = ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)); p1 #setgraph paper p2 = p1 + geom_point(aes(color = Species));p2 #use geom to specify what to plot p3 = p2 + geom_smooth(method='lm');p3 #add a linear regression model to fit the data p4 = p3 + xlab("Petal Length (cm)") + ylab("Petal Width (cm)") +ggtitle("PetalLgth vs Petal Width"); p4 #create/modify title Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Other Example Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Example: Student Admissions data > UCBdt Admit Gender Dept Freq 1 Admitted Male A 512 Aggregate data on applicants to 2 Rejected Male A 313 3 Admitted Female A 89 postgraduate school at Berkeley for the six 4 Rejected Female A 19 largest departments classified by admission 5 Admitted Male B 353 6 Rejected Male B 207 and gender. 7 Admitted Female B 17 8 Rejected Female B 8 9 Admitted Male C 120 10 Rejected Male C 205 11 Admitted Female C 202 Admission Levels: Admitted/Rejected 12 Rejected Female C 391 Gender: Male/Female 13 Admitted Male D 138 Department: A-F 14 Rejected Male D 279 15 Admitted Female D 131 16 Rejected Female D 244 17 Admitted Male E 53 18 Rejected Male E 138 19 Admitted Female E 94 20 Rejected Female E 299 21 Admitted Male F 22 22 Rejected Male F 351 23 Admitted Female F 24 24 Rejected Female F 317 1 5 Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad Simple Visual: Student Admissions -package plyr Highest admission for More males than Highest admission for department A compared females admitted to department A compared to the rest. Lowest the university to the rest. Lowest admission for department admission for department F F Dept. A & B discriminate Dept. A & B discriminate gender for admission. gender for admission. 1 6 Workshop: Getting Started with R. UTM 14 Oct 2018 .© Dr. Norhaiza Ahmad #------------- # Advanced- Use package plyr : Students Admission library(plyr) library(datasets) UCBdt <- as.data.frame(UCBAdmissions) overall <- ddply(UCBdt, .(Gender), function(gender) { temp <- c(sum(gender[gender$Admit == "Admitted", "Freq"]), sum(gender[gender$Admit == "Rejected", "Freq"])) / sum(gender$Freq) names(temp) <- c("Admitted", "Rejected") temp }) departmentwise <- ddply(UCBdt, .(Gender,Dept), function(gender) { temp <- gender$Freq / sum(gender$Freq) names(temp) <- c("Admitted", "Rejected") temp }) # A barplot for overall admission percentage for each gender. p1 <- ggplot(data = overall, aes(x = Gender, y = Admitted, width = 0.2)) p1 <- p1 + geom_bar(stat = "identity") + ggtitle("Overall admission percentage") + ylim(0,1) ;p1 # A 1x6 panel of barplots, each of which represents the # admission percentage for a department p2 <- ggplot(data = UCBdt[UCBdt$Admit == "Admitted", ], aes(x = Gender, y = Freq)) p2 <- p2 + geom_bar(stat = "identity") + facet_grid(. ~ Dept) + ggtitle("Number of admitted students\nfor each department") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ;p2 # A 1x6 panel of barplots, each of which represents the # number of admitted students for a department p3 <- ggplot(data = departmentwise, aes(x = Gender, y = Admitted)) p3 <- p3 + geom_bar(stat = "identity") + facet_grid(.
Recommended publications
  • Crop Monitoring System Using Image Processing and Environmental Data
    Crop Monitoring System Using Image Processing and Environmental Data Shadman Rabbi - 18341015 Ahnaf Shabik - 13201013 Supervised by: Dr. Md. Ashraful Alam Department of Computer Science and Engineering BRAC University Declaration We, hereby declare that this thesis is based on the results found by ourselves. Materials of work found by other researcher are mentioned by reference. This thesis, neither in whole or in part, has been previously submitted for any degree. ______________________ Signature of Supervisor (Dr. Md. Ashraful Alam) ___________________ Signature of Author (Shadman Rabbi) ___________________ Signature of Author (Ahnaf Shabik) i Acknowledgement Firstly we would like to thank the almighty for enabling us to initiate our research, to put our best efforts and successfully conclude it. Secondly, we offer our genuine and heartiest appreciation to our regarded Supervisor Dr. Md. Ashraful Alam for his contribution, direction and support in leading the research and preparation of the report. His involvement, inclusion and supervision have inspired us and acted as a huge incentive all through our research. Last but not the least, we are grateful to the resources, seniors, companions who have been indirectly but actively helpful with the research. Especially Monirul Islam Pavel, Ekhwan Islam, Touhidul Islam and Tahmidul Haq has propelled and motivated us through this journey. We would also like to acknowledge the help we got from various assets over the internet; particularly from fellow researchers’ work. ii Table of Contents Declaration
    [Show full text]
  • Scikit-Learn
    Scikit-Learn i Scikit-Learn About the Tutorial Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. Audience This tutorial will be useful for graduates, postgraduates, and research students who either have an interest in this Machine Learning subject or have this subject as a part of their curriculum. The reader can be a beginner or an advanced learner. Prerequisites The reader must have basic knowledge about Machine Learning. He/she should also be aware about Python, NumPy, Scipy, Matplotlib. If you are new to any of these concepts, we recommend you take up tutorials concerning these topics, before you dig further into this tutorial. Copyright & Disclaimer Copyright 2019 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial.
    [Show full text]
  • Implementation of Multivariate Data Set by Cart Algorithm
    International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 455-459 IMPLEMENTATION OF MULTIVARIATE DATA SET BY CART ALGORITHM Sneha Soni Data mining deals with various applications such as the discovery of hidden knowledge, unexpected patterns and new rules from large Databases that guide to make decisions about enterprise to products and services competitive. Basically, data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. Data Mining, which is known as knowledge discovery in databases has been defined as the nontrivial extraction of implicit, previous unknown and potentially useful information from data. In this paper CART Algorithm is presented which is well known for classification task of the datamining.CART is one of the best known methods for machine learning and computer statistical representation. In CART Result is represented in the form of Decision tree diagram or by flow chart. This paper shows results of multivariate dataset Classification by CART Algorithm. Multivariate dataset Encompassing the Simultaneous observation and analysis of more than one statistical variable. Keywords: Data Mining, Decision Tree, Multivariate Dataset, CART Algorithm 1. INTRODUCTION 2. CART ALGORITHM In data mining and machine learning different classifier are CART stands for Classification and Regression Trees a used for classifying different dataset for finding optimal classical statistical and machine learning method introduced classification. In this paper Classification and Regression by Leo Breiman, Jerome Friedman, Richard Olsen and Tree or CART Algorithm is implanted on multivariate Charles Stone in 1984.it is a data mining procedure to present dataset.
    [Show full text]
  • B. Gnana Priya Assistant Professor Department of Computer Science and Engineering Annamalai University
    A COMPARISON OF VARIOUS MACHINE LEARNING ALGORITHMS FOR IRIS DATASET B. Gnana Priya Assistant Professor Department of Computer Science and Engineering Annamalai University ABSTRACT Machine learning techniques are applications includes video surveillance, e- used for classification to predict group mail spam filtering, online fraud detection, membership of the data given. In this virtual personal assistance like Alexa, paper different machine learning automatic traffic prediction using GPS and algorithms are used for Iris plant many more. classification. Plant attributes like petal and sepal size of the Iris plant are taken Machine learning algorithms are and predictions are made from analyzing broadly classified into supervised and unsupervised algorithms. Supervised the pattern to find the class of Iris plant. Algorithms such as K-nearest neigbour, learning is a method in which we train the Support vector machine, Decision Trees, machine using data which are well labelled Random Forest and Naive Bayes classifier or the problem for which answers are well were applied to the Iris flower dataset and known. Then the machine is provided with were compared. new set of examples and the supervised learning algorithm analyses the training KEYWORDS: Machine learning, KNN, data and comes out with predictions and SVM, Naive Bayes, Decision Trees, produces an correct outcome from labelled Random Forest data. Unsupervised learning is the training of machine using information that is 1. INTRODUCTION neither classified nor labelled and allowing Machine learning is employed in the algorithm to act on that information almost every field nowadays. Starting without guidance. Reinforcement learning from the recommendations based on our approach is based on observation.
    [Show full text]
  • The Iris Dataset Revisited... Informatica 44 (2020) 35–44 37
    https://doi.org/10.31449/inf.v44i1.2715 Informatica 44 (2020) 35–44 35 The Iris Dataset Revisited – a Partial Ordering Study Lars Carlsen Awareness Center, Linkøpingvej 35, Trekroner, DK-4000 Roskilde4, Denmark E-mail: [email protected] Rainer Bruggemann Leibniz-Institute of Freshwater Ecology and Inland Fisheries, Department: Ecohydrology D-92421 Schwandorf, Oskar - Kösters-Str. 11, Germany E-mail: [email protected] Keywords: IRIS data set, partial ordering, separability, dominance, classification Received: March 11, 2019 The well-known Iris data set has been studied applying partial ordering methodology. Previous studies, e.g., applying supervision learning such as neural networks (NN) and support-vector machines (SVM) perfectly distinguish between the three Iris subgroups, i.e., Iris Setosa, Iris Versicolour and Iris Virginica, respectively, in contrast to, e.g., K-means clustering that only separates the full Iris data set in two clusters. In the present study applying partial ordering methodology further discloses the difference between the different classification methods. The partial ordering results appears to be in perfect agreement with the results of the K-means clustering, which means that the clear separation in the three Iris subsets applying NN and SVM is neither recognized by clustering nor by partial ordering methodology. Povzetek: Analizirana je znana baza učnih domen Iris s poudarkom na nekaterih metodah, recimo gručenju. 1 Introduction One of the most often applied datasets in machine discipline with the methodological components of learning studies test cases is the Iris dataset [1, 2]. This combinatorics, algebra and graph theory, can be dataset includes 150 entries comprising 3 x 50 entries for attributed to the work of Birkhoff [5] and Hasse [6].
    [Show full text]
  • Identification of Iris Flower Species Using Machine Learning
    IPASJ International Journal of Computer Science (IIJCS) Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm A Publisher for Research Motivation ........ Email:[email protected] Volume 5, Issue 8, August 2017 ISSN 2321-5992 Identification Of Iris Flower Species Using Machine Learning Shashidhar T Halakatti1, Shambulinga T Halakatti2 1Department. of Computer Science Engineering, Rural Engineering College ,Hulkoti – 582205 Dist : Gadag State : Karnataka Country : India 2Department of Electronics & Communication Engineering BVVS Polytechnic, Bagalkot – 587101 Dist : Bagalkot State : Karnataka Country: India ABSTRACT In Machine Learning, we are using semi-automated extraction of knowledge of data for identifying IRIS flower species. Classification is a supervised learning in which the response is categorical that is its values are in finite unordered set. To simply the problem of classification, scikit learn tools has been used. This paper focuses on IRIS flower classification using Machine Learning with scikit tools. Here the problem concerns the identification of IRIS flower species on the basis of flowers attribute measurements. Classification of IRIS data set would be discovering patterns from examining petal and sepal size of the IRIS flower and how the prediction was made from analyzing the pattern to from the class of IRIS flower. In this paper we train the machine learning model with data and when unseen data is discovered the predictive model predicts the species using what it has been learnt from the trained data. Keywords: Classification, Logistic Regression, K Nearest Neighbour, Machine Learning. 1. INTRODUCTION The Machine Learning is the subfield of computer science, according to Arthur Samuel in 1959 told “computers are having the ability to learn without being explicitly programmed”.
    [Show full text]
  • Machine-Learning
    machine-learning #machine- learning Table of Contents About 1 Chapter 1: Getting started with machine-learning 2 Remarks 2 Examples 2 Installation or Setup using Python 2 Installation or Setup using R Language 5 Chapter 2: An introduction to Classificiation: Generating several models using Weka 7 Introduction 7 Examples 7 Getting started: Loading a dataset from file 7 Train the first classifier: Setting a baseline with ZeroR 8 Getting a feel for the data. Training Naive Bayes and kNN 9 Putting it together: Training a tree 11 Chapter 3: Deep Learning 13 Introduction 13 Examples 13 Short brief of Deep learning 13 Chapter 4: Evaluation Metrics 18 Examples 18 Area Under the Curve of the Receiver Operating Characteristic (AUROC) 18 Overview – Abbreviations 18 Interpreting the AUROC 18 Computing the AUROC 19 Confusion Matrix 21 ROC curves 22 Chapter 5: Getting started with Machine Learning using Apache spark MLib 24 Introduction 24 Remarks 24 Examples 24 Write your first classification problem using Logistic Regression model 24 Chapter 6: Machine learning and it's classification 28 Examples 28 What is machine learning ? 28 What is supervised learning ? 28 What is unsupervised learning ? 29 Chapter 7: Machine Learning Using Java 30 Examples 30 tools list 30 Chapter 8: Natural Language Processing 33 Introduction 33 Examples 33 Text Matching or Similarity 33 Chapter 9: Neural Networks 34 Examples 34 Getting Started: A Simple ANN with Python 34 Backpropagation - The Heart of Neural Networks 37 1. Weights Initialisation 37 2. Forward Pass 38 3. Backward
    [Show full text]
  • A Study of Pattern Recognition of Iris Flower Based on Machine Learning 2
    Bachelor’s Thesis (UAS) Degree Program: Information Technology Specialization: Internet Technology 2013 Yu Yang A study of pattern recognition of Iris flower based on Machine Learning 2 BACHELOR’S THESIS | ABSTRACT TURKU UNIVERSITY OF APPLIED SCIENCES Degree Program: Information Technology | Specialization: Internet Technology 2013 | 43 Instructor: Patric Granholm Yu Yang A study of pattern recognition of Iris flower based on Machine Learning As we all know from the nature, most of creatures have the ability to recognize the objects in order to identify food or danger. Human beings can also recognize the types and application of objects. An interesting phenomenon could be that machines could recognize objects just like us someday in the future. This thesis mainly focuses on machine learning in pattern recognition applications. Machine learning is the core of Artificial Intelligence (AI) and pattern recognition is also an important branch of AI. In this thesis, the conception of machine learning and machine learning algorithms are introduced. Moreover, a typical and simple machine learning algorithm called K-means is introduced. A case study about Iris classification is introduced to show how the K-means works in pattern recognition. The aim of the case study is to design and implement a system of pattern recognition for the Iris flower based on Machine Learning. This project shows the workflow of pattern recognition and how to use machine learning approach to achieve this goal. The data set was collected from an open source website of
    [Show full text]
  • Introduction to R and Statistical Data Analysis
    Microarray Center IntroductionIntroduction toto RR andand StatisticalStatistical DataData AnalysisAnalysis PARTPART IIII Petr Nazarov petr.nazarov@crp -sante.lu 22-11-2010 LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor, etc . Principle component analysis and clustering (9) PCA, k-means clustering, hierarchical clustering Random numbers (10) random number generators, distributions Statistical tests (11) t-test, Wilcoxon test, multiple test correction. ANOVA and Linear regression (12) ANOVA, linear regression LookLook for for corresponding corresponding scripts scripts at at http://edu.sablab.net/r2010/scriptshttp://edu.sablab.net/r2010/scripts LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 8. DESCRIPTIVE STATISTICS IN R 8.1-8.3. Center, Variation, Dependency LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Iris Data from R.A.Fisher TheThe IrisIris flowerflower datadata setset oror Fisher'sFisher's IrisIris datadata setset isis aa multivariatemultivariate datadata setset introducedintroduced byby SirSir RonaldRonald AylmerAylmer FisherFisher (1936)(1936) asas anan exampleexample ofof discridiscriminantminant analysis.analysis. ItIt isis sometimessometimes calledcalled Anderson'sAnderson's IrisIris datadata setset becausebecause EdgarEdgar AndersonAnderson colcollectedlected thethe datadata toto quantifyquantify thethe geographicgeographic variationvariation ofof IrisIris flowersflowers inin thethe
    [Show full text]
  • Robust Fuzzy Cluster Ensemble on Cancer Gene Expression Data
    University of Nevada Reno Robust Fuzzy Cluster Ensemble on Cancer Gene Expression Data A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering by Yan Yan Dr. Frederick C. Harris, Jr./Dissertation Advisor May, 2019 Copyright by Yan Yan 2018 All Rights Reserved THE GRADUATE SCHOOL We recommend that the dissertation prepared under our supervision by YAN YAN Entitled Robust Improved Fuzzy Cluster Ensemble On Cancer Gene Expression Data be accepted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Dr. Frederick C. Harris, Jr., Advisor Dr. Sergiu Dascalu, Committee Member Dr. Dwight Egbert, Committee Member Dr. Ania Panorska, Committee Member Dr. Tin Nguyen, Committee Member Dr. Yantao Shen, Graduate School Representative David W. Zeh, Ph. D., Dean, Graduate School May, 2019 i Abstract In the past few decades, there has been tremendous growth in the scale and complexity of biological data generated by emerging high-throughput biotechnologies, including gene expression data generated by microarray technology. High-throughput gene expression data may contain gene expression measurements of thousands or millions of genes in a single data set, and provide us opportunities to explore the cell on a genome wide scale. Finding patterns in genomic data is a very important task in bioinformatics research and biomedical applications. Many clustering algorithms have been applied to gene expression data to find patterns. Nonetheless, there are still a number of challenges for clustering gene expression data because of the specific characteristics of such data and the special requirements from the domain of biology.
    [Show full text]
  • Spectral Clustering and Visualization: a Novel Clustering of Fisher’S Iris Data Set∗
    SPECTRAL CLUSTERING AND VISUALIZATION: A NOVEL CLUSTERING OF FISHER'S IRIS DATA SET∗ DAVID BENSON-PUTNINSy , MARGARET BONFARDINz , MEAGAN E. MAGNONIx , AND DANIEL MARTIN{ Advisors: Carl D. Meyer1 and Charles D. Wessell2 Abstract. Clustering is the act of partitioning a set of elements into subsets, or clusters, so that elements in the same cluster are, in some sense, similar. Determining an appropriate number of clusters in a particular data set is an important issue in data mining and cluster analysis. Another important issue is visualizing the strength, or connectivity, of clusters. We begin by creating a consensus matrix using multiple runs of the clustering algorithm k-means. This consensus matrix can be interpreted as a graph, which we cluster using two spectral clustering methods: the Fiedler Method and the MinMaxCut Method. To determine if increasing the number of clusters from k to k + 1 is appropriate, we check whether an existing cluster can be split. Finally, we visualize the strength of clusters by using the consensus matrix and the clustering obtained through one of the aforementioned spectral clustering techniques. Using these methods, we then investigate Fisher's Iris data set. Our methods support the exis- tence of four clusters, instead of the generally accepted three clusters in this data. Key words. cluster analysis, k-means, eigen decomposition, Laplacian matrix, data visualiza- tion, Fisher's Iris data set AMS subject classifications. 91C20, 15A18 1. Introduction. Clustering is the act of assigning a set of elements into subsets, or clusters, so that elements in the same cluster are, in some sense, similar.
    [Show full text]
  • 3D Data Visualization Techniques and Applications for Visual Multidimensional Data Mining
    UNIVERSITA` DEGLI STUDI DI SALERNO Dipartimento di Informatica Dottorato di Ricerca in Informatica XII Ciclo - Nuova Serie Tesi di Dottorato in 3D data visualization techniques and applications for visual multidimensional data mining Fabrizio Torre Ph.D. Program Chair Supervisors Ch.mo Prof. Giuseppe Persiano Ch.mo Prof. Gennaro Costagliola Dott. Vittorio Fuccella Anno Accademico 2013/2014 To my children: Sara and Luca in the hope of being able to lovingli preserve and grow... the little Angels who surely watch over me from the sky... Abstract Despite modern technology provide new tools to measure the world around us, we are quickly generating massive amounts of high-dimensional, spatial-temporal data. In this work, I deal with two types of datasets: one in which the spatial characteristics are relatively dynamic and the data are sampled at different periods of time, and the other where many dimensions prevail, although the spatial characteristics are relatively static. The first dataset refers to a peculiar aspect of uncertainty arising from contractual relationships that regulate a project execution: the dispute management. In recent years there has been a growth in size and complexity of the projects managed by public or private organizations. This leads to increased probability of project failures, frequently due to the difficulty and the ability to achieve the objectives such as on-time delivery, cost containment, expected quality achievement. In particular, one of the most common causes of project failure is the very high degree of uncertainty that affects the expected performance of the project, especially when different stakeholders with divergent aims and goals are involved in the project.
    [Show full text]