Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması

Total Page:16

File Type:pdf, Size:1020Kb

Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması Mümine KAYA1, Selma Ayşe ÖZEL 2 1 Adana Bilim ve Teknoloji Üniversitesi, Bilgisayar Mühendisliği Bölümü, Adana 2 Çukurova Üniversitesi, Bilgisayar Mühendisliği Bölümü, Adana [email protected], [email protected] Özet: Veri Madenciliği, büyük miktarda veri içinden gizli bağıntı ve kuralların, bilgisayar yazılımları ve istatiksel yöntemler kullanılarak çıkarılması işlemidir. Veri madenciliği yöntemleri ve yazılımlarının amacı büyük miktarlardaki verileri etkin ve verimli bir şekilde işlemektir. Yapılan çalışmada; açık kaynak kodlu veri madenciliği yazılımlarından Keel, Knime, Orange, R, RapidMiner (Yale) ve Weka karşılaştırılmıştır. Böylece kullanılacak veri kümeleri için hangi yazılımın daha etkin bir şekilde çalışacağı belirlenebilmiştir. Anahtar Sözcükler: Veri Madenciliği, Açık Kaynak, Veri Madenciliği Yazılımları. Comparison of Open Source Data Mining Software Abstract: Data Mining is a process of discovering hidden correlations and rules within large amounts of data using computer software and statistical methods. The aim of data mining methods and software is to process large amounts of data efficiently and effectively. In this study, open source data mining tools namely Keel, Knime, Orange, R, RapidMiner (Yale), and Weka were compared. As a result of this study, it is possible to determine which data mining software is more efficient and effective for which kind of data sets. Keywords: Data Mining, Open Source, Data Mining Software. 1. Giriş madenciliği büyük ölçekli veriler arasından yararlı ve anlaşılır olanların bulunup ortaya Günümüzde bilişim teknolojisi, veri iletişim çıkarılması işlemidir [1]. Veri Madenciliği ile teknolojileri ve veri toplama araçları oldukça veriler arasındaki ilişkiler ortaya gelişmiş ve yaygınlaşmış; bu hızlı gelişim koyulabilmekte ve gelecekle ilgili büyük boyutlu veri kaynaklarının oluşmasına tahminlerde bulunulabilmektedir. Veri neden olmuş ve beraberinde bazı problemlere Madenciliğinin geleneksel veritabanı yol açmıştır [1]. Bu problemlerin başında, sorgularından farkı şu şekilde özetlenebilir: i) veritabanları içinde yer alan ancak basit SQL Geleneksel veri tabanlarında sorgu, SQL gibi sorguları ile bulunamayan anlamlı ve yararlı iyi tanımlanmış bir sorgulama dili ile bilginin ortaya çıkarılması gelmektedir. Bu yapılırken, veri madenciliğinde ise sorgu iyi nedenle verileri işlemek için bazı çözümleme tanımlı ya da tam tanımlı olmayabilir; ii) yöntemlerine ihtiyaç duyulmuştur. Veri Geleneksel veri tabanlarında sorgunun Madenciliği bu ihtiyacı gidermek için ortaya sonucu, veri tabanında yer alan verilerin bir çıkarılmış bir yöntemdir. Veri Madenciliği alt kümesi olup, veri madenciliğinde ise daha önceden bilinmeyen, geçerli ve çoğunlukla veri tabanının bir alt kümesi uygulanabilir bilgilerin geniş veri olmaz, onun yerine veri tabanındaki içeriğin kaynaklarından elde edilmesi işlemidir [2]. bir analizi olur. Daha da özetlemek gerekirse, veri Veri Madenciliğinin amacı ham veriyi birçok veri kaynağından gelen farklı anlamlı, etkin ve yararlı olan bilgiye formatlardaki veri birleştirilebilmekte; veri dönüştürebilmektir [3]. Bu amaca seçme adımında yapılacak olan analiz ile ulaşabilmek için de Veri Madenciliği ilgili olan veriler belirlenmekte; veri konusunda geliştirilmiş yazılımların dönüşümü adımında verinin veri madenciliği kullanılması veri madenciliği süreçlerini tekniğinde kullanılabilecek hale dönüşümü kolaylaştırmaktadır. gerçekleştirilmekte; veri madenciliği adımında veri örüntülerini yakalayabilmek Bugüne kadar yapılan çalışmalarda; veri için akıllı metotlar uygulanmakta; örüntü madenciliği yazılımlarının bir kısmının değerlendirme adımında bazı ölçütlere göre detaya girilmeden, ya tanımlamaları ya da elde edilmiş bilgiyi temsil eden ilginç uygulamaları yapılmıştır [4, 5 ve 6]. Bu örüntüler tanımlanmakta ve bilgi sunumu çalışmada ise diğer çalışmalardan farklı adımında ise elde edilmiş bilginin kullanıcıya olarak, altı adet veri madenciliği yazılımı sunumu gerçekleştirilmektedir [7, 8]. daha detaylı karşılaştırılmıştır. Böylece ihtiyaca göre daha etkin bir şekilde kullanılacak olan yazılımlar belirlenmiştir. Çalışmanın ikinci bölümünde veri madenciliği süreci hakkında bilgi verilmiştir. Üçüncü bölümde çalışmada kullanılan açık kaynak kodlu yazılımlar tanıtılmış olup, dördüncü bölümde bu yazılımların farklı açılardan karşılaştırılmaları yapılmıştır. Son bölümde ise sonuç ve önerilere yer verilmiştir. 2. Veri Madenciliği Süreci Şekil 1. Bilgi Keşfi Süreci Veri madenciliği bilgi keşfi işleminin en önemli adımlarındandır. Bilgi keşfi adımları: 2.1 Veri Madenciliğinin Kullanım Alanları Veri Temizleme, Veri Bütünleştirme, Veri Veri Madenciliği; bankacılık, borsa, Seçme, Veri Dönüşümü, Veri Madenciliği, pazarlama yönetimi, perakende satış, işaret Örüntü Değerlendirme ve Bilgi Sunumu işleme, sigortacılık, telekomünikasyon, olmak üzere 7 basamaktan oluşmaktadır [7]. elektronik ticaret, sağlık, tıp, biyoloji, genetik, endüstri, eğitim, istihbarat, bilim ve Şekil 1’de de görüldüğü üzere bu süreç, ele mühendislik gibi birçok dalda uygulama alanı alınan problemin tanımlanması ile başlamakta bulunmaktadır [1, 2]. ve sırasıyla; problemle ilgili verilerin toplanması, verilerin hazırlanması, verilere ve 2.2 Veri Madenciliği Yazılımları probleme uygun modelin tasarlanması, Veri Madenciliği konusunda geliştirilmiş tasarımı yapılan modelin uygunluğunun ve birçok yazılım bulunmaktadır. Bu yeterliliğinin değerlendirilmesi ile devam yazılımlardan kimisi ticari iken, kimisi açık etmekte ve son olarak modelin kaynak kodludur. Bu nedenle veri uygulanmasıyla sonuca ulaştırılmaktadır. Bu madenciliği yazılımları ticari ve açık kaynak sonuca ulaşırken de veri temizleme adımında kodlu olmak üzere iki gruba ayrılmaktadır. gürültülü ve tutarsız veriler veri kümesinden Ticari yazılımlara SPSS Clementine, Excel, çıkarılmakta; veri bütünleştirme adımında SPSS, SAS, Angoss, KXEN, MS SQL Server, MATLAB ve Oracle’ın bu amaçla geliştirilmiş bir yazılımdır [6]. Orange geliştirdiği modülleri örnek olarak yazılımı C++ dili ile geliştirilmiştir. verilebilmektedir. Açık kaynak yazılımlara Yazılımın ara yüzleri ve grafik ortamı ise Qt3 ise Orange, RapidMiner, WEKA, R, Keel, kütüphanesi ve Python kullanılarak Knime, Tanagra, Scriptella ETL, jHepWork geliştirilmiştir [6]. Görselleştirme açısından ve Elki örnek olarak verilebilmektedir [4, 5]. zayıf bir yazılımdır. Yalnızca metin dosyalarından veri alma işlemini 3. Açık Kaynak Kodlu Veri Madenciliği gerçekleştirmektedir. Yazılımları 3.4 R Bu bölümde açık kaynak kodlu ve ücretsiz Auckland Üniversitesi İstatistik Bölümü olarak kullanılabilen Keel, Knime, Orange, bilim adamlarından olan Robert Gentleman R, RapidMiner (Yale) ve Weka yazılımları ve Ross Ihaka tarafından grafikler, hakkında özet bilgi verilmektedir. istatistiksel hesaplamalar ve veri analizleri için geliştirilmiş bir yazılımdır [4]. R [12], 3.1 Keel Unix makinelerde yaygın olarak Keel [9], İspanya Ulusal Bilim Projeleri kullanılmaktadır. R, bir veri madenciliği Kurumunun desteği ile Granada Üniversitesi yazılımı olarak çok fazla tercih tarafından geliştirilen Java dilinde yazılmış edilmemektedir. bir yazılımdır. Keel, kümeleme ve sınıflandırma gibi klasik veri madenciliği 3.5 RapidMiner (Yale) algoritmaları açısından zengin değildir. RapidMiner [13], Ralf Klinkenberg, Ingo Bunların yerine Fuzzy sınıflandırıcılar, Mierswa ve Simon Fischer tarafından Yapay zekâ tabanlı sınıflandırma ve Kural Dortmund Teknoloji Üniversitesi Yapay Zeka tabanlı kümeleme algoritmalarının birçok Biriminde geliştirilmiş bir yazılımdır. Yale çeşidini içermektedir [6]. Veri görselleştirme [14] ise Yale üniversitesi bilim adamları açısından en zayıf yazılımlardan biri Keel’dir. tarafından Java dili kullanılarak geliştirilmiş bir yazılımdır. Yale artık RapidMiner [13] adı 3.2 Knime altında yazılım olarak kullanılmaya devam Konstanz Information Miner (KNIME) [10], etmektedir. Diğer veri madenciliği Konstanz Üniversitesi görsel veri madenciliği yazılımlarından farklı olarak 22 adet dosya araştırma grubu tarafından Eclipse Rich formatındaki veriyi işleyebilmektedir. Veri Client Platform üzerinde geliştirilen bir Madenciliği ve Makine Öğrenme yazılımdır. Knime genişletilebilme özellikleri Algoritmalarını da kapsayan RapidMiner, ile ön plana çıkmaktadır. Kullanıcılara bir Weka gibi oldukça fazla algoritmaya sahiptir. yazılım geliştirme kiti sunarak kullanıcıların Veri Analizi, Önişleme, Sınıflama, kendi modüllerini yazabilmelerini sağlayan Kümeleme, Birliktelik Kuralları Çıkarımı, tek uygulamadır [6]. Kurulum şartı olmadan Nitelik Seçimi işlemlerini içermektedir. çalışabilmektedir. Knime yazılımı .txt uzantılı Oracle, MS SQL Server, PostgreSQL, metin dosyalarından veya .arff, .table MySQL, JDBC, Sybase, Access, IBM DB2, formatından veri alabilmektedir. Knime, en Ingres veritabanlarını ve metin dosyalarını zengin görselleştirme araçları sunan desteklemektedir [6]. Bu açıdan en kapsamlı yazılımlarından biridir. yazılımlardan biridir. Excel dosyalarıyla bağlantı kurulabilmektedir. MS Windows, 3.3 Orange GNU/Linux, Mac Os X işletim sistemlerinde Orange [11], Slovenya Ljubljana Üniversitesi kolayca ve hatasız olarak çalışabilmektedir. Görselleştirme ve grafik arayüzü açısından da Bilgisayar ve Enformatik Bilimleri bölümü yapay zekâ araştırmaları ekibi tarafından en zengin yazılımlardan biridir. İçerisinden olan yazılım ise bir istatistik yazılımı olan script yazılabilir. R’dır. Orange, RapidMiner ve Weka R’a göre istatiksel hesaplamada daha zayıf yazılımlar 3.6 Weka arasındadır. Bunda
Recommended publications
  • Feature Selection for Gender Classification in TUIK Life Satisfaction Survey A
    Feature Selection for Gender Classification in TUIK Life Satisfaction Survey A. ÇOBAN1 and İ. TARIMER2 1 Muğla Sıtkı Koçman University, Muğla/Turkey, [email protected] 2Muğla Sıtkı Koçman University, Muğla/Turkey, [email protected] Abstract— As known, attribute selection is a in [8] [9] [10]. method that is used before the classification of data Life satisfaction is a cognitive and judicial mining. In this study, a new data set has been created situation which expresses the evaluation of life as a by using attributes expressing overall satisfaction in whole. Happiness on the other hand is conceived as Turkey Statistical Institute (TSI) Life Satisfaction Survey dataset. Attributes are sorted by Ranking an emotional state produced by positive and search method using attribute selection algorithms in negative events and experiences in the life of the a data mining application. These selected attributes individual. Although there are some correlations were subjected to a classification test with Naive between happiness and life satisfaction at different Bayes and Random Forest from machine learning levels, these concepts are still different [11]. algorithms. The feature selection algorithms are The concept of subjective well-being, which we compared according to the number of attributes selected and the classification accuracy rates cannot separate from the concept of happiness, is achievable with them. In this study, which is aimed at defined as people's evaluations of their quality of reducing the dataset volume, the best classification life [6] [7]. Researches and surveys on life result comes up with 3 attributes selected by the Chi2 satisfaction and happiness have been used as algorithm.
    [Show full text]
  • Effect of Distance Measures on Partitional Clustering Algorithms
    Sesham Anand et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (6) , 2015, 5308-5312 Effect of Distance measures on Partitional Clustering Algorithms using Transportation Data Sesham Anand#1, P Padmanabham*2, A Govardhan#3 #1Dept of CSE,M.V.S.R Engg College, Hyderabad, India *2Director, Bharath Group of Institutions, BIET, Hyderabad, India #3Dept of CSE, SIT, JNTU Hyderabad, India Abstract— Similarity/dissimilarity measures in clustering research of algorithms, with an emphasis on unsupervised algorithms play an important role in grouping data and methods in cluster analysis and outlier detection[1]. High finding out how well the data differ with each other. The performance is achieved by using many data index importance of clustering algorithms in transportation data has structures such as the R*-trees.ELKI is designed to be easy been illustrated in previous research. This paper compares the to extend for researchers in data mining particularly in effect of different distance/similarity measures on a partitional clustering algorithm kmedoid(PAM) using transportation clustering domain. ELKI provides a large collection of dataset. A recently developed data mining open source highly parameterizable algorithms, in order to allow easy software ELKI has been used and results illustrated. and fair evaluation and benchmarking of algorithms[1]. Data mining research usually leads to many algorithms Keywords— clustering, transportation Data, partitional for similar kind of tasks. If a comparison is to be made algorithms, cluster validity, distance measures between these algorithms.In ELKI, data mining algorithms and data management tasks are separated and allow for an I. INTRODUCTION independent evaluation.
    [Show full text]
  • An Analysis on the Performance of Rapid Miner and R Programming Language As Data Pre-Processing Tools for Unsupervised Form of Insurance Claim Dataset
    International Journal of Computer Sciences and Engineering Open Access Research Paper Vol.-7, Special Issue, 5, March 2019 E-ISSN: 2347-2693 An Analysis on the Performance of Rapid Miner and R Programming Language as Data Pre-processing Tools for Unsupervised Form of Insurance Claim Dataset Surya Susan Thomas1*, Ananthi Sheshasaayee2, 1,2PG & Research Department of Computer Science, Quaid -E- Millath Government College for Women, Chennai, India *Corresponding Author: [email protected], 9940439667 DOI: https://doi.org/10.26438/ijcse/v7si5.14 | Available online at: www.ijcseonline.org Abstract— Data Science has emerged as a super science in almost all the sectors of analytics. Data Mining is the key runner and the pillar stone of data analytics. The analysis and study of any form of data has become so relevant in todays’ scenario and the output from these studies give great societal contributions and hence are of great value. Data analytics involves many steps and one of the primary and the most important one is data pre-processing stage. Raw data has to be cleaned, stabilized and processed to a new form to make the analysis easier and correct. Many pre-processing tools are available but this paper specifically deals with the comparative study of two tools such as Rapid Miner and R programming language which are predominantly used by data analysts. The output of the paper gives an insight into the weightage of the particular tool which can be recommended for better data pre-processing. Keywords- Data analytics, data pre-processing, noise removal, clean data, Rapid Miner, R programming I.
    [Show full text]
  • Industry Applications of Machine Learning and Data Science
    #1 Agile Predictive Analytics Platform for Today’s Modern Analysts Industry Applications of Machine Learning and Data Science Ralf Klinkenberg, Co-Founder & Head of Data Science Research, RapidMiner [email protected] www.RapidMiner.com Industrial Data Science Conference (IDS 2019), Dortmund, Germany, March 13th, 2019 Can you predict the future? ©2015©2016 RapidMiner, Inc. All rights reserved. - 2 - Predictive Analytics finds the hidden patterns in big data and uses them to predict future events. 473ms © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 3 - 3 Machine Learning: Pattern Detection, Trend Detection, Finding Correlations & Causal Relations, etc. from Data to Create Models that Enable the Automated Classification of New Cases, Forecasting of Events or Values, Prediction of Risks & Opportunities ©2015 RapidMiner, Inc. All rights reserved. - 4 - Predictive Analytics Transforms Insight into ACTION Prescriptive ACT Operationalize Predictive ANTICIPATE What will happen Diagnostic EXPLAIN Why did it happen Value Descriptive OBSERVE Analytics What happened ©2016 RapidMiner, Inc. All rights reserved. - 5 - Industry Applications of Machine Learning and Predictive Analytics ©2016 RapidMiner, Inc. All rights reserved. - 6 - Customer Churn Prediction: Predict Which Customers are about to Churn. Energy Provider E.ON: 17 Million Customers. => Predict & Prevent Churn => Secure Revenue, Less Costly Than Acquiring New Customers. © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 7 - Demand Forecasting: Predict Which Book Will be Sold How Often in Which Region. Book Retailer Libri: 33 Million Transactions per Year. => Guarantee Availability and Delivery Times. © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved. - 8 - Predictive Maintenance: Predict Machine Failures before They Happen in order to Prevent Them, => Demand-Based Maintenance, Fewer Failures, Lower Costs © 2018 RapidMiner, GmbH & RapidMiner, Inc.: all rights reserved.
    [Show full text]
  • Mining the Web of Linked Data with Rapidminer
    Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015) 142–151 Contents lists available at ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Mining the Web of Linked Data with RapidMiner Petar Ristoski ∗, Christian Bizer, Heiko Paulheim Data and Web Science Group, University of Mannheim, B6, 26, 68159 Mannheim, Germany article info a b s t r a c t Article history: Lots of data from different domains are published as Linked Open Data (LOD). While there are quite Received 30 January 2015 a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for Received in revised form deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we 11 May 2015 introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining Accepted 11 June 2015 and analysis platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, Available online 8 July 2015 allowing for using it in sophisticated data analysis workflows without the need for expert knowledge in SPARQL or RDF. The extension allows for autonomously exploring the Web of Data by following Keywords: Linked Open Data links, thereby discovering relevant datasets on the fly, as well as for integrating overlapping data found Data mining in different datasets. As an example, we show how statistical data from the World Bank on scientific RapidMiner publications, published as an RDF data cube, can be automatically linked to further datasets and analyzed using additional background knowledge from ten different LOD datasets.
    [Show full text]
  • Rapidminer Operator Reference Manual ©2014 by Rapidminer
    RapidMiner Operator Reference Manual ©2014 by RapidMiner. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of RapidMiner. Preface Welcome to the RapidMiner Operator Reference, the final result of a long work- ing process. When we first started to plan this reference, we had an extensive discussion about the purpose of this book. Would anybody want to read the hole book? Starting with Ada Boost and reading the description of every single operator to the X-Validation? Or would it only serve for looking up particular operators, although that is also possible in the program interface itself? We decided for the latter and with growing progress in the making of this book, we realized how fuitile this entire discussion has been. It was not long until the book reached the 600 pages limit and now we nearly hit the 1000 pages, what is far beyond anybody would like to read entirely. Even if there would be a great structure, explaining the usage of single groups of operators as guiding transitions between the explanations of single operators, nobody could comprehend all that. The reader would have long forgotten about the Loop Clusters operator until he get's to know about cross validation. So we didn't dump any effort in that and hence the book has become a pure reference. For getting to know RapidMiner itself, this is not a suitable document. Therefore we would rather recommend to read the manual as a starting point.
    [Show full text]
  • ML-Flex: a Flexible Toolbox for Performing Classification Analyses
    JournalofMachineLearningResearch13(2012)555-559 Submitted 6/11; Revised 2/12; Published 3/12 ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel Stephen R. Piccolo [email protected] Department of Pharmacology and Toxicology, School of Pharmacy University of Utah Salt Lake City, UT 84112, USA Lewis J. Frey [email protected] Huntsman Cancer Institute Department of Biomedical Informatics, School of Medicine University of Utah Salt Lake City, UT 84112, USA Editor: Geoff Holmes Abstract Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics do- main, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was writ- ten in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. ML- Flex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggre- gating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research 1. Introduction The machine-learning community has developed a wide array of classification algorithms, but they are implemented in diverse programming languages, have heterogeneous interfaces, and require disparate file formats. Also, because input data come in assorted formats, custom transformations often must precede classification. To address these challenges, we developed ML-Flex, a general- purpose toolbox for performing two-class and multi-class classification analyses.
    [Show full text]
  • Machine Learning
    1 LANDSCAPE ANALYSIS MACHINE LEARNING Ville Tenhunen, Elisa Cauhé, Andrea Manzi, Diego Scardaci, Enol Fernández del Authors Castillo, Technology Machine learning Last update 17.12.2020 Status Final version This work by EGI Foundation is licensed under a Creative Commons Attribution 4.0 International License This template is based on work, which was released under a Creative Commons 4.0 Attribution License (CC BY 4.0). It is part of the FitSM Standard family for lightweight IT service management, freely available at www.fitsm.eu. 2 DOCUMENT LOG Issue Date Comment Author init() 8.9.2020 Initialization of the document VT The second 16.9.2020 Content to the most of chapters VT, AM, EC version Version for 6.10.2020 Content to the chapter 3, 7 and 8 VT, EC discussions Chapter 2 titles 8.10.2020 Chapter 2 titles and Acumos added VT and bit more to the frameworks Towards first full 16.11.2020 Almost every chapter has edited VT version First full version 17.11.2020 All chapters reviewed and edited VT Addition 19.11.2020 Added Mahout and H2O VT Fixes 22.11.2020 Bunch of fixes based on Diego’s VT comments Fixes based on 24.11.2020 2.4, 3.1.8, 6 VT discussions on previous day Final draft 27.11.2020 Chapters 5 - 7, Executive summary VT, EC Final draft 16.12.2020 Some parts revised based on VT comments of Álvaro López García Final draft 17.12.2020 Structure in the chapter 3.2 VT updated and smoke libraries etc.
    [Show full text]
  • A Comparative Study on Various Data Mining Tools for Intrusion Detection
    International Journal of Scientific & Engineering Research Volume 9, Issue 5, May-2018 1 ISSN 2229-5518 A Comparative Study on Various Data Mining Tools for Intrusion Detection 1Prithvi Bisht, 2Neeraj Negi, 3Preeti Mishra, 4Pushpanjali Chauhan Department of Computer Science and Engineering Graphic Era University, Dehradun Email: {1prithvisbisht, 2neeraj.negi174, 3dr.preetimishranit, 4pushpanajlichauhan}@gmail.com Abstract—Internet world is expanding day by day and so are the threats related to it. Nowadays, cyber attacks are happening more frequently than a decade before. Intrusion detection is one of the most popular research area which provides various security tools and techniques to detect cyber attacks. There are many ways to detect anomaly in any system but the most flexible and efficient way is through data mining. Data mining tools provide various machine learning algorithms which are helpful for implementing machine- learning based IDS. In this paper, we have done a comparative study of various state of the art data mining tools such as RapidMiner, WEKA, EOA, Scikit-Learn, Shogun, MATLAB, R, TensorFlow, etc for intrusion detection. The specific characteristics of individual tool are discussed along with its pros & cons. These tools can be used to implement data mining based intrusion detection techniques. A preliminary result analysis of three different data-mining tools is carried out using KDD’ 99 attack dataset and results seem to be promising. Keywords: Data mining, Data mining tools, WEKA, RapidMiner, Orange, KNIME, MOA, ELKI, Shogun, R, Scikit-Learn, Matlab —————————— —————————— 1. Introduction These patterns can help in differentiating between regular activity and malicious activity. Machine-learning based IDS We are living in the modern era of Information technology where most of the things have been automated and processed through computers.
    [Show full text]
  • The Forrester Wave
    LICENSED FOR INDIVIDUAL USE ONLY The Forrester Wave™: Multimodal Predictive Analytics And Machine Learning Solutions, Q3 2018 The 13 Providers That Matter Most And How They Stack Up by Mike Gualtieri and Kjell Carlsson, Ph.D. September 5, 2018 Why Read This Report Key Takeaways In our 24-criteria evaluation of multimodal SAS, IBM, And RapidMiner Lead The Pack predictive analytics and machine learning (PAML) Our research uncovered a market in which SAS, providers, we identified the 13 most significant IBM, and RapidMiner are Leaders; KNIME, ones — Dataiku, Datawatch, FICO, IBM, KNIME, SAP, Datawatch, TIBCO Software, and Dataiku MathWorks, Microsoft, RapidMiner, Salford are Strong Performers; FICO, MathWorks, Systems (Minitab), SAP, SAS, TIBCO Software, and Microsoft are Contenders; and World and World Programming — and researched, Programming and Salford Systems (Minitab) are analyzed, and scored them. This report shows Challengers. All included vendors have unique how each provider measures up and helps sweet spots that continue to satisfy enterprise enterprise application development and delivery data science teams. (AD&D) leaders make the best choice. Data Science Teams Want To Shed Their Math Nerd Image In 2012, Harvard Business Review asserted that data scientist is “The Sexiest Job Of The 21st Century.” But being “sexy” without being “social” is to fritter away opportunity. Data scientists get this. That’s why they want PAML solutions that also serve the many collaborators in an enterprise needed to bring their good work to production applications. Multimodal PAML Solutions Are Flush With New Innovation It was a very good year for multimodal PAML vendors. After years of incremental, ho- hum innovation, Forrester sees some bright lights: reimagined data science workbenches, collaborations tools designed for non-data scientist enterprise roles, hopped-up automation, and some enticing road maps for next year.
    [Show full text]
  • Magic Quadrant for Data Science Platforms Published: 14 February 2017 ID: G00301536 Analyst(S): Alexander Linden, Peter Krensky, Jim Hare, Carlie J
    (https://www.gartner.com/home) LICENSED FOR DISTRIBUTION Magic Quadrant for Data Science Platforms Published: 14 February 2017 ID: G00301536 Analyst(s): Alexander Linden, Peter Krensky, Jim Hare, Carlie J. Idoine, Svetlana Sicular, Shubhangi Vashisth Summary Data science platforms are engines for creating machine-learning solutions. Innovation in this market focuses on cloud, Apache Spark, automation, collaboration and artificial-intelligence capabilities. We evaluate 16 vendors to help you make the best choice for your organization. Market Definition/Description This Magic Quadrant evaluates vendors of data science platforms. These are products that organizations use to build machine-learning solutions themselves, as opposed to outsourcing their creation or buying ready-made solutions (see "Machine-Learning and Data Science Solutions: Build, Buy or Outsource?" ). There are countless tasks for which organizations prefer this approach, especially when "good enough" packaged applications, APIs and SaaS solutions do not yet exist. Examples are numerous. They include demand prediction, failure prediction, determination of customers' propensity to buy or churn, and fraud detection. Gartner previously called these platforms "advanced analytics platforms" (as in the preceding "Magic Quadrant for Advanced Analytics Platforms" ). Recently, however, the term "advanced analytics" has fallen somewhat out of favor as many vendors have added "data science" to their marketing narratives. This is one reason why we now call this category "data science platforms," but it is not the main reason. Our chief reason is that it is commonly "data scientists" who use these platforms. We define a data science platform as: A cohesive software application that offers a mixture of basic building blocks essential for creating all kinds of data science solution, and for incorporating those solutions into business processes, surrounding infrastructure and products.
    [Show full text]
  • Rapidminer Studio
    Cross-modular Applications of AI and ML in B5G networks Edwin Yaqub, RapidMiner Research 27 May 2020 Presentation Layout • Introduction to (EU-H2020) project ARIADNE - Vision - Use Cases • Artificial Intelligence (AI)/Machine Learning (ML) Application Areas - AI/ML Landscape and Disciplines - Predictive Analytics, Prescriptive Analytics and Predictive Optimizations • Approaching AI and ML • Cross-Modular Concerns • Standardization Initiatives • RapidMiner Data Science Platform - Automated ML - Management and Orchestration Tools 2 Introduction to Project ARIADNE 3 Coordinator Dr. Halid Hrasnica Eurescom, Heidelberg, Germany Scientific and Technical Project Manager Prof. Dr. Angeliki Alexiou University of Piraeus Research Centre, Athens, Greece Website: https://www.ict-ariadne.eu Twitter: @AriadneIct Email: [email protected] 4 ARIADNE is a three years Research and Innovation action / project under the EU program Horizon 2020 (Grant Agreement no. 871464 ) started on 1 November 2019 Introduction to ARIADNE project • Vision Artificial Intelligence Aided D-band Network for 5G Long Term Evolution is a H2020 5G PPP project which aims to bring together a novel high frequency radio architecture, an advanced wireless connectivity based on reconfigurable metasurfaces, and an enhanced network management supported by AI to establish a new type of intelligent communications system beyond 5G. 5 RoadMap: Vision to Objectives and Actions 6 Use Cases • Use case 1: Outdoor backhaul/fronthaul networks of fixed topology - Scenario 1: Long-range LOS rooftop
    [Show full text]