Ai Programming with Java

Total Page:16

File Type:pdf, Size:1020Kb

Ai Programming with Java Molei Hong AI PROGRAMMING WITH JAVA Final Thesis CENTRIA UNIVERSITY OF APPLIED SCIENCES Information Technology December 2020 Centria University Date Author of Applied Sciences 1/9/2021 Molei Hong Degree programme Information Technology Name of thesis AI PROGRAMMING WITH JAVA Instructor Pages Kauko Kolehmainen 42 Supervisor Kauko Kolehmainen The thesis was about artificial intelligence project in Java. It started with the theoretical part of basic presentation of Java programming language, introduced its history, importance and features. The sig- nificant description of Java platform component Java Virtual Machine was involved. The definition of AI, the types of AI and the different AI level which included machine learning and deep learning were also presented. In the practical part, Weka, a software of machine learning algorithm in Java was intro- duced and there were two data mining and classification experiments by using Weka demonstrated. After that, a Java project with Weka’s Naïve Bayesian algorithm illustrated the principle of email clas- sification. In the project, necessary elements such as Weka packages, the datasets and important meth- ods were presented in detail. Key words Java programming, Java Virtual Machine, Artificial Intelligence, Weka, Naïve Bayesian, spam filter, data classification. ABSTRACT CONCEPT DEFINITIONS CONTENTS 1 INTRODUCTION ........................................................................................................................... 1 2 JAVA PROGRAMMING LANGUAGE ........................................................................................ 2 2.1 Java platform Standard Edition .................................................. Error! Bookmark not defined. 2.2 Java Virtual Machine .................................................................... Error! Bookmark not defined. 3 ARTIFICIAL INTELLIGENCE .................................................................................................... 7 3.1 The definitions of AI .................................................................................................................. 9 3.2 The categories of AI ................................................................................................................. 11 3.2.1 Machine Learning ......................................................................................................... 12 3.2.2 Deep Learning ............................................................................................................... 14 3.3 Machine Learning tools and libraries in Java ........................................................................ 15 4 WEKA ........................................................................................................................................... 16 4.1 The explorer experiment ......................................................................................................... 18 4.2 The class structure Weka ........................................................................................................ 23 4.2.1 The Weka.core package ................................................................................................. 24 4.2.2 The Weka.classifiers package ........................................................................................ 25 5 PROJECT EXPLANATION ........................................................................................................ 26 5.1 Naive Bayesian algorithmic reasoning .................................................................................... 26 5.2 Selection of datasets ................................................................................................................. 28 5.3 Importing Weka Packages ...................................................................................................... 30 5.4 Training data ........................................................................................................................... 32 5.5 Testing data ............................................................................................................................. 35 5.6 Evaluation result...................................................................................................................... 36 6 CONCLUSION ............................................................................................................................. 38 REFERENCES ................................................................................................................................ 39 SOURCE OF FIGURES ................................................................................................................. 41 LIST OF FIGURES Figure 1. Benefits of programming AI in Java ................................................................................. 3 Figure 2. The interface of Java Platform Manager (Netbeans IDE 8.2) .......................................... 4 Figure 3. The internal architecture of the JVM ................................................................................ 5 Figure 4. The structure of Java 2 platform SE 5.0............................................................................ 6 Figure 5. Four categories are organized for the definitions of AI. ................................................. 10 Figure 6. Autonomous driving as the start of Artificial Intelligence level ..................................... 11 Figure 7. Three learning methods of machine learning algorithms ............................................... 13 Figure 8. Weka GUI Chooser (Version 3.8.3) ................................................................................. 17 Figure 9. Problems of ARFF format in UCI dataset....................................................................... 18 Figure 10. Reading in the diabetes data .......................................................................................... 18 Figure 11. Classifier output of diabetes dataset .............................................................................. 19 Figure 12. Images of Rhino and Brontosaurus ............................................................................... 20 Figure 13. Equal frequency, 43 examples of each in the dataset .................................................... 20 Figure 14. Additional features were added to the dataset .............................................................. 21 Figure 15. Classifier output ............................................................................................................. 22 Figure 16. Package Manager in Weka ............................................................................................ 23 Figure 17. Method of Copyable interface ........................................................................................ 24 Figure 18. Method distributionForInstance() ................................................................................. 25 Figure 19. Training dataset and testing dataset .............................................................................. 28 Figure 20. TrainingData.arff opened as txt file............................................................................... 29 Figure 21. Libraries included weka.jar and JDK 1.8 ..................................................................... 30 Figure 22. Weka.core packages used in the source code ................................................................ 31 Figure 23. Weka.classifiers package and StringToWordVector class are imported in the source code ................................................................................................................................................... 31 Figure 24. Defined instances, attributes of training data and locations of training dataset files .. 32 Figure 25. Attributes "label" and "text" were added and “label” was assigned values ............... 32 Figure 26. Loading and saving dataset content ............................................................................... 33 Figure 27. Converting text to feature vector ................................................................................... 33 Figure 28. Using tokenizer to split the string .................................................................................. 33 Figure 29. Building the classifier ..................................................................................................... 34 Figure 30. Evaluating the classifier model ...................................................................................... 35 Figure 31. Main method for running classifier ............................................................................... 36 Figure 32. Output after running ...................................................................................................... 37 0 1 1 INTRODUCTION In recent years, machine learning, data science and artificial intelligence have been the most talked about technologies. This is a matter of course. These advancements in technology have brought automation and business processes to a new level. Organizations of all sizes have invested millions of dollars in research and personnel to develop these extremely powerful data-driven applications. There are many different programming languages that can be used to develop machine learning and data science appli- cations. Although Python and R have become the first choice for developing these programs, many organizations are turning to Java development to meet their
Recommended publications
  • Data Mining – Intro
    Data warehouse& Data Mining UNIT-3 Syllabus • UNIT 3 • Classification: Introduction, decision tree, tree induction algorithm – split algorithm based on information theory, split algorithm based on Gini index; naïve Bayes method; estimating predictive accuracy of classification method; classification software, software for association rule mining; case study; KDD Insurance Risk Assessment What is Data Mining? • Data Mining is: (1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets (2) Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner Knowledge Discovery Examples of Large Datasets • Government: IRS, NGA, … • Large corporations • WALMART: 20M transactions per day • MOBIL: 100 TB geological databases • AT&T 300 M calls per day • Credit card companies • Scientific • NASA, EOS project: 50 GB per hour • Environmental datasets KDD The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation Data Mining Methods 1. Decision Tree Classifiers: Used for modeling, classification 2. Association Rules: Used to find associations between sets of attributes 3. Sequential patterns: Used to find temporal associations in time series 4. Hierarchical
    [Show full text]
  • Towards a Fully Automated Extraction and Interpretation of Tabular Data Using Machine Learning
    UPTEC F 19050 Examensarbete 30 hp August 2019 Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Per Hedbrant Master Thesis in Engineering Physics Department of Engineering Sciences Uppsala University Sweden Abstract Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Teknisk- naturvetenskaplig fakultet UTH-enheten Motivation A challenge for researchers at CBCS is the ability to efficiently manage the Besöksadress: different data formats that frequently are changed. Significant amount of time is Ångströmlaboratoriet Lägerhyddsvägen 1 spent on manual pre-processing, converting from one format to another. There are Hus 4, Plan 0 currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Postadress: Box 536 751 21 Uppsala Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for Telefon: automated recognition and loading of data stored in arbitrary formats. The aim of 018 – 471 30 03 this study is three-folded: A) Investigate if unsupervised machine learning Telefax: methods can be used to label different types of cells in spreadsheets. B) 018 – 471 30 00 Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and Hemsida: technologies for the SaaS solution. http://www.teknat.uu.se/student Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done.
    [Show full text]
  • Advanced Data Mining with Weka (Class 5)
    Advanced Data Mining with Weka Class 5 – Lesson 1 Invoking Python from Weka Peter Reutemann Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Lesson 5.1: Invoking Python from Weka Class 1 Time series forecasting Lesson 5.1 Invoking Python from Weka Class 2 Data stream mining Lesson 5.2 Building models in Weka and MOA Lesson 5.3 Visualization Class 3 Interfacing to R and other data mining packages Lesson 5.4 Invoking Weka from Python Class 4 Distributed processing with Apache Spark Lesson 5.5 A challenge, and some Groovy Class 5 Scripting Weka in Python Lesson 5.6 Course summary Lesson 5.1: Invoking Python from Weka Scripting Pros script captures preprocessing, modeling, evaluation, etc. write script once, run multiple times easy to create variants to test theories no compilation involved like with Java Cons programming involved need to familiarize yourself with APIs of libraries writing code is slower than clicking in the GUI Invoking Python from Weka Scripting languages Jython - https://docs.python.org/2/tutorial/ - pure-Java implementation of Python 2.7 - runs in JVM - access to all Java libraries on CLASSPATH - only pure-Python libraries can be used Python - invoking Weka from Python 2.7 - access to full Python library ecosystem Groovy (briefly) - http://www.groovy-lang.org/documentation.html - Java-like syntax - runs in JVM - access to all Java libraries on CLASSPATH Invoking Python from Weka Java vs Python Java Output public class Blah { 1: Hello WekaMOOC! public static void main(String[]
    [Show full text]
  • Introduction to Weka
    Introduction to Weka Overview What is Weka? Where to find Weka? Command Line Vs GUI Datasets in Weka ARFF Files Classifiers in Weka Filters What is Weka? Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Where to find Weka Weka website (Latest version 3.6): – http://www.cs.waikato.ac.nz/ml/weka/ Weka Manual: − http://transact.dl.sourceforge.net/sourcefor ge/weka/WekaManual-3.6.0.pdf CLI Vs GUI Recommended for in-depth usage Explorer Offers some functionality not Experimenter available via the GUI Knowledge Flow Datasets in Weka Each entry in a dataset is an instance of the java class: − weka.core.Instance Each instance consists of a number of attributes Attributes Nominal: one of a predefined list of values − e.g. red, green, blue Numeric: A real or integer number String: Enclosed in “double quotes” Date Relational ARFF Files The external representation of an Instances class Consists of: − A header: Describes the attribute types − Data section: Comma separated list of data ARFF File Example Dataset name Comment Attributes Target / Class variable Data Values Assignment ARFF Files Credit-g Heart-c Hepatitis Vowel Zoo http://www.cs.auckland.ac.nz/~pat/weka/ ARFF Files Basic statistics and validation by running: − java weka.core.Instances data/soybean.arff Classifiers in Weka Learning algorithms in Weka are derived from the abstract class: − weka.classifiers.Classifier Simple classifier: ZeroR − Just determines the most common class − Or the median (in the case of numeric values) − Tests how well the class can be predicted without considering other attributes − Can be used as a Lower Bound on Performance.
    [Show full text]
  • Mathematica Document
    Mathematica Project: Exploratory Data Analysis on ‘Data Scientists’ A big picture view of the state of data scientists and machine learning engineers. ����� ���� ��������� ��� ������ ���� ������ ������ ���� ������/ ������ � ���������� ���� ��� ������ ��� ���������������� �������� ������/ ����� ��������� ��� ���� ���������������� ����� ��������������� ��������� � ������(�������� ���������� ���������) ������ ��������� ����� ������� �������� ����� ������� ��� ������ ����������(���� �������) ��������� ����� ���� ������ ����� (���������� �������) ����������(���������� ������� ���������� ��� ���� ���� �����/ ��� �������������� � ����� ���� �� �������� � ��� ����/���������� ��������������� ������� ������������� ��� ���������� ����� �����(���� �������) ����������� ����� / ����� ��� ������ ��������������� ���������� ����������/�++ ������/������������/����/������ ���� ������� ����� ������� ������� ����������������� ������� ������� ����/����/�������/��/��� ����������(�����/����-�������� ��������) ������������ In this Mathematica project, we will explore the capabilities of Mathematica to better understand the state of data science enthusiasts. The dataset consisting of more than 10,000 rows is obtained from Kaggle, which is a result of ‘Kaggle Survey 2017’. We will explore various capabilities of Mathematica in Data Analysis and Data Visualizations. Further, we will utilize Machine Learning techniques to train models and Classify features with several algorithms, such as Nearest Neighbors, Random Forest. Dataset : https : // www.kaggle.com/kaggle/kaggle
    [Show full text]
  • WEKA: the Waikato Environment for Knowledge Analysis
    WEKA: The Waikato Environment for Knowledge Analysis Stephen R. Garner Department of Computer Science, University of Waikato, Hamilton. ABSTRACT WEKA is a workbench designed to aid in the application of machine learning technology to real world data sets, in particular, data sets from New Zealand’s agricultural sector. In order to do this a range of machine learning techniques are presented to the user in such a way as to hide the idiosyncrasies of input and output formats, as well as allow an exploratory approach in applying the technology. The system presented is a component based one that also has application in machine learning research and education. 1. Introduction The WEKA machine learning workbench has grown out of the need to be able to apply machine learning to real world data sets in a way that promotes a “what if?…” or exploratory approach. Each machine learning algorithm implementation requires the data to be present in its own format, and has its own way of specifying parameters and output. The WEKA system was designed to bring a range of machine learning techniques or schemes under a common interface so that they may be easily applied to this data in a consistent method. This interface should be flexible enough to encourage the addition of new schemes, and simple enough that users need only concern themselves with the selection of features in the data for analysis and what the output means, rather than how to use a machine learning scheme. The WEKA system has been used over the past year to work with a variety of agricultural data sets in order to try to answer a range of questions.
    [Show full text]
  • Artificial Intelligence for the Prediction of Exhaust Back Pressure Effect On
    applied sciences Article Artificial Intelligence for the Prediction of Exhaust Back Pressure Effect on the Performance of Diesel Engines Vlad Fernoaga 1 , Venetia Sandu 2,* and Titus Balan 1 1 Faculty of Electrical Engineering and Computer Science, Transilvania University, Brasov 500036, Romania; [email protected] (V.F.); [email protected] (T.B.) 2 Faculty of Mechanical Engineering, Transilvania University, Brasov 500036, Romania * Correspondence: [email protected]; Tel.: +40-268-474761 Received: 11 September 2020; Accepted: 15 October 2020; Published: 21 October 2020 Featured Application: This paper presents solutions for smart devices with embedded artificial intelligence dedicated to vehicular applications. Virtual sensors based on neural networks or regressors provide real-time predictions on diesel engine power loss caused by increased exhaust back pressure. Abstract: The actual trade-off among engine emissions and performance requires detailed investigations into exhaust system configurations. Correlations among engine data acquired by sensors are susceptible to artificial intelligence (AI)-driven performance assessment. The influence of exhaust back pressure (EBP) on engine performance, mainly on effective power, was investigated on a turbocharged diesel engine tested on an instrumented dynamometric test-bench. The EBP was externally applied at steady state operation modes defined by speed and load. A complete dataset was collected to supply the statistical analysis and machine learning phases—the training and testing of all the AI solutions developed in order to predict the effective power. By extending the cloud-/edge-computing model with the cloud AI/edge AI paradigm, comprehensive research was conducted on the algorithms and software frameworks most suited to vehicular smart devices.
    [Show full text]
  • Statistical Software
    Statistical Software A. Grant Schissler1;2;∗ Hung Nguyen1;3 Tin Nguyen1;3 Juli Petereit1;4 Vincent Gardeux5 Keywords: statistical software, open source, Big Data, visualization, parallel computing, machine learning, Bayesian modeling Abstract Abstract: This article discusses selected statistical software, aiming to help readers find the right tool for their needs. We categorize software into three classes: Statisti- cal Packages, Analysis Packages with Statistical Libraries, and General Programming Languages with Statistical Libraries. Statistical and analysis packages often provide interactive, easy-to-use workflows while general programming languages are built for speed and optimization. We emphasize each software's defining characteristics and discuss trends in popularity. The concluding sections muse on the future of statistical software including the impact of Big Data and the advantages of open-source languages. This article discusses selected statistical software, aiming to help readers find the right tool for their needs (not provide an exhaustive list). Also, we acknowledge our experiences bias the discussion towards software employed in scholarly work. Throughout, we emphasize the software's capacity to analyze large, complex data sets (\Big Data"). The concluding sections muse on the future of statistical software. To aid in the discussion, we classify software into three groups: (1) Statistical Packages, (2) Analysis Packages with Statistical Libraries, and (3) General Programming Languages with Statistical Libraries. This structure
    [Show full text]
  • Frequent Item Set Mining Using INC MINE in Massive Online Analysis Frame Work
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 45 ( 2015 ) 133 – 142 International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015) Frequent Item set Mining using INC_MINE in Massive Online Analysis Frame work Prof.Dr.P.K.Srimania, Mrs. Malini M. Patilb* aFormer Chairman and Director, R & D, Bangalore University, Karnataka, India bAssistant Professor , Dept of ISE , J.S.S. Academy of Technical Education, Bangalore-560060, Karnataka, India Research Scholar, Bharthiar University, Coimbatore, Tamilnadu Abstract Frequent Pattern Mining is one of the major data mining techniques, which is exhaustively studied in the past decade. The technological advancements have resulted in huge data generation, having increased rate of data distribution. The generated data is called as a 'data stream'. Data streams can be mined only by using sophisticated techniques. The paper aims at carrying out frequent pattern mining on data streams. Stream mining has great challenges due to high memory usage and computational costs. Massive online analysis frame work is a software environment used to perform frequent pattern mining using INC_MINE algorithm. The algorithm uses the method of closed frequent mining. The data sets used in the analysis are Electricity data set and Airline data set. The authors also generated their own data set, OUR-GENERATOR for the purpose of analysis and the results are found interesting. In the experiments five samples of instance sizes (10000, 15000, 25000, 35000, 50000) are used with varying minimum support and window sizes for determining frequent closed itemsets and semi frequent closed itemsets respectively. The present work establishes that association rule mining could be performed even in the case of data stream mining by INC_MINE algorithm by generating closed frequent itemsets which is first of its kind in the literature.
    [Show full text]
  • MLJ: a Julia Package for Composable Machine Learning
    MLJ: A Julia package for composable machine learning Anthony D. Blaom1, 2, 3, Franz Kiraly3, 4, Thibaut Lienart3, Yiannis Simillides7, Diego Arenas6, and Sebastian J. Vollmer3, 5 1 University of Auckland, New Zealand 2 New Zealand eScience Infrastructure, New Zealand 3 Alan Turing Institute, London, United Kingdom 4 University College London, United Kingdom 5 University of Warwick, United Kingdom 6 University of St Andrews, St Andrews, United Kingdom 7 Imperial College London, United Kingdom DOI: 10.21105/joss.02704 Software • Review Introduction • Repository • Archive Statistical modeling, and the building of complex modeling pipelines, is a cornerstone of modern data science. Most experienced data scientists rely on high-level open source modeling Editor: Yuan Tang toolboxes - such as sckit-learn (Buitinck et al., 2013; Pedregosa et al., 2011) (Python); Weka (Holmes et al., 1994) (Java); mlr (Bischl et al., 2016) and caret (Kuhn, 2008) (R) - for quick Reviewers: blueprinting, testing, and creation of deployment-ready models. They do this by providing a • @degleris1 common interface to atomic components, from an ever-growing model zoo, and by providing • @henrykironde the means to incorporate these into complex work-flows. Practitioners are wanting to build increasingly sophisticated composite models, as exemplified in the strategies of top contestants Submitted: 21 July 2020 in machine learning competitions such as Kaggle. Published: 07 November 2020 MLJ (Machine Learning in Julia) (A. Blaom, 2020b) is a toolbox written in Julia that pro- License Authors of papers retain vides a common interface and meta-algorithms for selecting, tuning, evaluating, composing copyright and release the work and comparing machine model implementations written in Julia and other languages.
    [Show full text]
  • Performance Analysis of Hoeffding Trees in Data Streams by Using Massive Online Analysis Framework
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ePrints@Bangalore University Int. J. Data Mining, Modelling and Management, Vol. 7, No. 4, 2015 293 Performance analysis of Hoeffding trees in data streams by using massive online analysis framework P.K. Srimani R & D Division, Bangalore University Jnana Bharathi, Mysore Road, Bangalore-560056, Karnataka, India Email: [email protected] Malini M. Patil* Department of Information Science and Engineering, J.S.S. Academy of Technical Education, Uttaralli-Kengeri Main Road, Mylasandra, Bangalore-560060, Karnataka, India Email: [email protected] *Corresponding author Abstract: Present work is mainly concerned with the understanding of the problem of classification from the data stream perspective on evolving streams using massive online analysis framework with regard to different Hoeffding trees. Advancement of the technology both in the area of hardware and software has led to the rapid storage of data in huge volumes. Such data is referred to as a data stream. Traditional data mining methods are not capable of handling data streams because of the ubiquitous nature of data streams. The challenging task is how to store, analyse and visualise such large volumes of data. Massive data mining is a solution for these challenges. In the present analysis five different Hoeffding trees are used on the available eight dataset generators of massive online analysis framework and the results predict that stagger generator happens to be the best performer for different classifiers. Keywords: data mining; data streams; static streams; evolving streams; Hoeffding trees; classification; supervised learning; massive online analysis; MOA; framework; massive data mining; MDM; dataset generators.
    [Show full text]
  • Massive Online Analysis, a Framework for Stream Classification and Clustering
    MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. Albert Bifet1, Geoff Holmes1, Bernhard Pfahringer1, Philipp Kranen2, Hardy Kremer2, Timm Jansen2, and Thomas Seidl2 1 Department of Computer Science, University of Waikato, Hamilton, New Zealand fabifet, geoff, [email protected] 2 Data Management and Exploration Group, RWTH Aachen University, Germany fkranen, kremer, jansen, [email protected] Abstract. In today's applications, massive, evolving data streams are ubiquitous. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learn- ing from evolving data streams. MOA is designed to deal with the chal- lenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms compa- rable in benchmark streaming settings. It contains a collection of offline and online algorithms for both classification and clustering as well as tools for evaluation. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily compare several algorithms and apply them to real world data sets and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contri- butions and allows the creation of benchmark scenarios through storing and sharing setting files. 1 Introduction Nowadays data is generated at an increasing rate from sensor applications, mea- surements in network monitoring and traffic management, log records or click- streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others.
    [Show full text]