Java Machine Learning Libraries

Machine Learning in Java Design, build, and deploy your own machine learning applications by leveraging key Java machine learning libraries Boštjan Kaluža BIRMINGHAM - MUMBAI Machine Learning in Java Copyright © 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2016 Production reference: 1260416 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-658-9 www.packtpub.com Credits Author Project Coordinator Boštjan Kaluža Izzat Contractor Reviewers Proofreader Abhik Banerjee Safis Editing Wei Di Manjunath Narayana Indexer Mariammal Chettiyar Ravi Sharma Graphics Commissioning Editor Disha Haria Amarabha Banerjee Production Coordinator Acquisition Editor Nilesh Mohite Aaron Lazar Cover Work Content Development Editor Nilesh Mohite Rohit Singh Technical Editor Suwarna Patil Copy Editor Vibha Shukla About the Author Boštjan Kaluža, PhD, is a researcher in artificial intelligence and machine learning. Boštjan is the chief data scientist at Evolven, a leading IT operations analytics company, focusing on configuration and change management. He works with machine learning, predictive analytics, pattern mining, and anomaly detection to turn data into understandable relevant information and actionable insight. Prior to Evolven, Boštjan served as a senior researcher in the department of intelligent systems at the Jozef Stefan Institute, a leading Slovenian scientific research institution, and led research projects involving pattern and anomaly detection, ubiquitous computing, and multi-agent systems. Boštjan was also a visiting researcher at the University of Southern California, where he studied suspicious and anomalous agent behavior in the context of security applications. Boštjan has extensive experience in Java and Python, and he also lectures on Weka in the classroom. Focusing on machine learning and data science, Boštjan has published numerous articles in professional journals, delivered conference papers, and authored or contributed to a number of patents. In 2013, Boštjan published his first book on data science, Instant Weka How-to, Packt Publishing, exploring how to leverage machine learning using Weka. Learn more about him at http://bostjankaluza.net. About the Reviewers Abhik Banerjee has been a great data science leader, leading teams comprising of data scientists and engineers. He completed his masters from the University of Cincinnati, where his research focused on various data mining techniques related to itemset mining and biclustering techniques applied to biomedical informatics datasets. He has been working in the areas of machine learning and data mining in the industry for the past 7-8 years, solving various problems related to supervised learning (classification and regression techniques, such as SVM, Bayes net, GBM, GLM, neural networks, deep nets, and so on), unsupervised learning (clustering, blustering, LDA, and so on), and various NLP techniques. He had been working on how these various techniques can be applied to e-mail, biomedical informatics, and retail domains in order to understand the customer better and improve their experience. Abhik has a strong acumen of problem solving skills, spanning various technological solutions and architectures, such as Hadoop, MapReduce, Spark, Java, Python, machine learning, data mining, NLP, and so on. Wei Di is a data scientist. She is passionate about creating smart and scalable analytics and data mining solutions that can impact millions of individuals and empower successful business. Her interests cover wide areas, including artificial intelligence, machine learning, and computer vision. She was previously associated with eBay Human Language Technology team and eBay Research Labs, with a focus on image understanding for large-scale applications and joint learning from both visual and text information. Prior to this, she was with Ancestry.com, working on large-scale data mining and machine learning models in the areas of record linkage, search relevance, and ranking. She received her PhD from Purdue University in 2011, focusing on data mining and image classification. Manjunath Narayana received his PhD in computer science from the University of Massachusetts, Amherst, in 2014. He obtained his MS degree in computer engineering from the University of Kansas in 2007 and his BE degree in electronics and communications engineering from B. M. S. College of Engineering, Bangalore, India, in 2004. He is currently a robotics scientist at iRobot Corporation, USA, developing algorithms for consumer robots. Prior to iRobot, he was a research engineer at metaio, Inc., working on computer vision research for augmented reality applications and 3D reconstruction. He has worked in the Computer Vision Systems Toolbox group in The MathWorks, Inc., developing object detection algorithms. His research interests include machine learning, robotics, computer vision, deep learning, and augmented reality. His research has been published at top conferences such as CVPR, ICCV, and BMVC. Ravi Sharma is a lead data scientist and has expertise in both artificial intelligence and natural language processing. He is currently leading the data science research team at Msg.ai Inc., his commercial applications of data science include developing artificial chat bots for CPG brands, health care industry and entertainment industry. He has designed data collection systems and other strategies that optimize statistical efficiency and data quality. He has implemented a corporate big data-based data warehouse systems and distributed algorithms for high traffic. His areas of interest comprises the big data management platform, feature engineering, model building and tuning, exploratory data analysis, pattern analysis, outlier detection, collaborative filtering algorithms to provide recommendations and text analysis using NLP. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Table of Contents Preface ix Chapter 1: Applied Machine Learning Quick Start 1 Machine learning and data science 1 What kind of problems can machine learning solve? 2 Applied machine learning workflow 3 Data and problem definition 4 Measurement scales 5 Data collection 6 Find or observe data 7 Generate data 8 Sampling traps 9 Data pre-processing 9 Data cleaning 9 Fill missing values 10 Remove outliers 11 Data transformation 11 Data reduction 12 Unsupervised learning 13 Find similar items 13 Euclidean distances 13 Non-Euclidean distances 14 The curse of dimensionality 15 Clustering 16 Supervised learning 17 Classification 17 Decision tree learning 18 Probabilistic classifiers 18 Kernel methods 18 Artificial neural networks 18 Ensemble learning 19 [ i ] Table of Contents Evaluating classification 19 Regression 21 Linear regression 22 Evaluating regression 22 Generalization and evaluation 24 Underfitting and overfitting 24 Train and test sets 26 Cross-validation 26 Leave-one-out validation 26 Stratification 27 Summary 27 Chapter 2: Java Libraries and Platforms for Machine Learning 29 The need for Java 30 Machine learning libraries 30 Weka 30 Java machine learning 34 Apache Mahout 35 Apache Spark 36 Deeplearning4j 38 MALLET 39 Comparing libraries 41 Building a machine learning application 42 Traditional machine learning architecture 42 Dealing with big data 43 Big data application architecture 43 Summary 44 Chapter 3: Basic Algorithms – Classification, Regression, and Clustering 45 Before you start 46 Classification 46 Data 47 Loading data 48 Feature selection 49 Learning algorithms 50 Classify new data 53 Evaluation and prediction error metrics 54 Confusion matrix 54 Choosing a classification algorithm 55 Regression 56 Loading the data 56 Analyzing attributes 58 [ ii ] Table of Contents Building and evaluating regression model

Load more