Supervised Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Tikaram Acharya Supervised Machine Learning Accuracy Comparison of Different Algorithms on Sample Data Metropolia University of Applied Sciences Bachelor of Engineering Information Technology Bachelor’s Thesis 11 January 2021 Author Tikaram Acharya Title Supervised Machine Learning: Accuracy Comparison of Differ- ent Algorithms on Sample Data Number of Pages 39 pages Date 11 January 2021 Degree Bachelor of Engineering Degree Programme Information Technology Professional Major Software Engineering Instructors Janne Salonen, Head of School The purpose of the final year project was to discuss the theoretical concepts of the analyt- ical features of supervised machine learning. The concepts were illustrated with sample data selected from the open-source platform. The primary objectives of the thesis include explaining the concept of machine learning, presenting supportive libraries, and demon- strating their application or utilization on real data. Finally, the aim was to use machine learning to determine whether a person would pay back their loan or not. The problem was taken as a classification problem. To accomplish the goals, data was collected from an open data source community called Kaggle. The data was downloaded and processed with Python. The complete practical part or machine learning workflow was done in Python and Python’s libraries such as NumPy, Pandas, Matplotlib, Seaborn, and SciPy. For the given problem on sample data, five predictive models were constructed and tested with machine learning with different techniques and combinations. The results were highly accurately matched with the classification problem after evaluating their f1-score, confu- sion matrix, and Jaccard similarity score. Indeed, it is clear that the project achieved the main defined objectives, and the necessary architecture has been set up to apply various analytical abilities into the project. Keywords Python, NumPy, Scikit-Learn, Matplotlib, Predictive Analysis Contents List of Abbreviations 1 Introduction 1 2 Introduction Machine Learning 3 2.1 Concepts and Models 3 2.2 Business Understanding 4 2.3 Importance 4 3 Theory of Machine Learning 6 3.1 History of Machine Learning 6 3.2 Supervised and Unsupervised Learning 8 3.3 Classification and Regression 8 3.4 Types of Models 10 3.5 Machine Learning Algorithms 12 3.5.1 Support Vector Machine 12 3.5.2 K-nearest Neighbor 12 3.5.3 Random Forest Classifier 13 3.5.4 Decision Tree 14 3.5.5 Logistic Regression 14 4 Methods and Materials 15 4.1 Tools and Technologies 15 4.1.1 Python 15 4.1.2 Python Libraries 16 4.2 Data and Data Source 17 4.3 Data Implementation 20 4.3.1 Data Preparation 20 4.3.2 Loading the data 20 4.3.3 Descriptive Analysis 21 4.3.4 Categorical Features 22 4.3.5 Feature Matrix and Target Array 24 4.3.6 Data Splitting 26 4.4 Model Development 26 4.4.1 Hyperparameter Selection 27 4.4.2 Creating the Models 28 4.4.3 Fitting the Models 29 4.4.4 Model Prediction 30 4.5 Model Evaluation 30 5 Results and Discussion 34 6 Conclusion 36 References 37 List of Abbreviations FDIC Federal Deposit Insurance KNN K Nearest Neighbors IPO Initial Public Offering CSV Comma-separated Value PC Personal Computer UCI Unique Client Identifier SVM Support Vector Machine 1 1 Introduction LendingClub is an American club, the headquarters being located in San Francisco, Cal- ifornia. LendingClub connects people who needed money (borrower) with people who want to give money as a loan (investor). LendingClub provides services to borrowers to create personal unsecured loans between $1,000 and $40,000. Loan lenders can find loan listings on lendinclub.com and set loan plans to invest based on supplied infor- mation about the loan taker, the purpose of the loan, credit score of borrowers, the in- come of the loan taker (monthly or yearly), and the duration of the loan (number of years). During the transaction or loan completion period, the lending club collects money by charging borrowers an organization fee and investors a service fee. On the other hand, LendingClub also decides the customer's loans for automobile transactions through WebBank, an FDIC-insured and state-chartered industrial bank. The loans are not granted by investors but are assigned to other financial institutions. [1.] LendingClub is the most promising company in the United States and the club became the largest technology IPO of 2014 in the US [1.]. In some cases, it may be a risk to lend money to people without reviewing the history of security during the transaction comple- tion and clearance period. A sample dataset of LendingClub’s transaction period from 2007-2010 was taken from an open-source platform called Kaggle. The practical part of the final project completely focuses on the potentiality of people and whether they pay back their loan fully or not. Investors want to give money to people who will most likely pay back their loans. The final year project was carried out to show how machine learning algorithms and methods can be applied in the case of LendingClub and a real-life scenario. The Python language was used in the project. The goal of the project was to build predictive machine learning models, which are described in section 3.4. Section 2 details the concept and discusses the importance of machine learning. The section also explains the business approach linked to machine learning. Section 3 is very important as it thoroughly pre- sents machine learning from a theoretical viewpoint. The section also illustrates the types of machine learning that exist and the types of models that are used to describe the accuracy of the predictive models made. In contrast, section 4 illustrates the practical 2 part of a machine learning workflow. Moreover, chapter 4 also explains the data and data source, data implementation, data handling, data preparation, model formation, and model evaluation with a proper theoretical and practical explanation. Finally, in the last two sections, the results are discussed in detail, and conclusions are drawn in sequential order. 3 2 Introduction Machine Learning Machine learning is a subset of Artificial Intelligence related to computational statistics. Machine learning does the same task as humans do. For example, earlier some emails had to be marked as spam but nowadays some email is directly marked as spam. The reason behind the case is that the application of machine learning has already classified some emails as spam based on their content structure. Machine learning consists of algorithms and methods to teach computers to understand the features of objects. Then computers build predictive models, prediction, and degree of performance. In another way, machine learning is one of the most popular fields of computer science which sup- ports computer programs to analyze large amounts of data using different machine learn- ing models to generate a prediction on given problems. Machine learning has also been implemented in large companies such as Netflix. Machine learning is powerful and is easily available to many users to use. [2.] In any datasets, machine learning plays a vital role in analyzing and interpreting the pat- terns and structures to enable user learning, reasoning, and decision making without conducting any human interaction. Machine learning allows the user to feed algorithms on a large amount of data to start computer analysis and make decisions or recommen- dations based on input. The algorithm incorporates the information to improve its results in case of a correction is identified. [2.] The thesis will showcase three simple methods on implementing when it comes to utiliz- ing machine learning. This section describes the process and how more businesses be- sides technological companies could gain potential benefits, supporting the growth of a business, the concept of models, and importance. 2.1 Concepts and Models Machine learning has a big impact on today’s world. As it is known that, machine learning is a field of artificial intelligence. It deals with large numbers of methods that discover 4 relationships between data by identifying and classifying them into different sets of cat- egories. For example, it is possible to find the suitable price of a certain house depending on the price of other houses with various values affecting features such as number of rooms, area of rooms, number of floors, and location by applying machine learning con- cepts. Machine learning concepts mainly focus on the prediction of certain possible events by using appropriate algorithms. Once data is fed on algorithms, then it is useful for prediction as a mathematical model. Certain parts of data known as training data are used to make mathematical models and the remaining parts of data known as testing data are used for the predictions. Meanwhile, the possible results of the specific problem can be found. [3.] 2.2 Business Understanding Machine learning algorithms find the meaningful hidden insight or pattern from the given data to solve complex business problems. Machine learning algorithms take data as in- put and make it possible to find various kinds of patterns or ideas without using a pro- gramming language. Machine learning is evolving rapidly and being run by new compu- ting tools and technologies. Machine learning has given so many advantages to the business field. For example, machine learning helps to increase business scalability and business operation across the globe. Artificial intelligence and machine learning have huge popularity in the field of the business analytics community. Business sectors may have different factors affecting their business such as growing volumes of a certain product, easy accessibility of data, cheaper and faster computational processing, and affordable data storage. With the full understanding of the business use of machine learning, any organization can implement it in its process.