Machine Learning Models to Predict House Prices Based on Home Features Venkat Shiva Pandiri
Total Page:16
File Type:pdf, Size:1020Kb
CALIFORNIA STATE UNIVERSITY SAN MARCOS PROJECT SIGNATURE PAGE PROJECT SUBMITIED IN PARTIA L FULFILLMENT OF Tl IE REQUTREMENTS FOR Tl IE DEGREE MASTER OF SCIENCE IN COMPUTER SCIENCE PROJECT TITLE: MACHINE LEARNING MODEL S TO PREDK1' HOUSE PRJCES BASED ON HOME FEATURES AUTHOR: Vcnkat Shiva Pandiri DATE OF SUCCESSFUL DEFENSE: 07121/2017 THE PROJECT HAS BEEN ACCEPTED BY THE PROJECT COMMITTEE IN PARTIAL FULFILLMENT OF TflE REQUIREMENTS l~ OR TllE DEGREE OF MASTER OF SCIENCE JN COMPUTER SCIENCE. Dr. Xiaoyu Zhang z/~ 7 /21(i =t- PROJECT COMMIITEE CHAIR SIGNATUJU: DATE Dr. Xin Ye 'XJA Ye 7/zt!11 PROJECT COMMITTEE MEMBER S I ~ DATE P a g e | 1 Machine Learning Models to Predict House Prices based on Home Features Venkat Shiva Pandiri California State University San Marcos P a g e | 2 Contents 1 Introduction ............................................................................................................................. 5 1.1 Dataset .............................................................................................................................. 5 2 Related Work .......................................................................................................................... 7 2.1 The applications for real estate with Machine Learning Technology .............................. 7 2.1.1 The fraud finding methodology in Zillow is as follows: .......................................... 8 2.2 Property Analysts and Economists ....................................................................................... 9 2.3 Machine Learning Algorithms ....................................................................................... 10 2.3.1 Multiple linear regression ............................................................................................ 10 2.3.2 Random forest Regression: .......................................................................................... 11 2.3.3 Polynomial Regression ................................................................................................ 12 3 Data Preprocessing................................................................................................................ 12 3.1 Importing the libraries......................................................................................................... 13 3.2 Getting the dataset............................................................................................................... 13 3.3 Handling Missing data ........................................................................................................ 14 3.4 Encoding categorical Data .................................................................................................. 15 4 Methods and implementations .............................................................................................. 16 4.1 Steps for building model ................................................................................................ 16 4.1.1 All in (not a technical word): .................................................................................. 16 4.1.2 Feature Selection: ................................................................................................ 17 4.1.3 Pearson Correlation Test ......................................................................................... 17 4.1.4 Outliers ......................................................................................................................... 19 4.1.5 Multicollinearity test .................................................................................................... 20 4.1.6 Backward elimination: ................................................................................................. 23 4.2 K-Fold cross validation. ...................................................................................................... 24 P a g e | 3 Problems with splitting dataset into training and testing sets ............................................... 24 5 Results ................................................................................................................................... 25 5.1 Multiple linear regression data Analysis and results: ..................................................... 25 5.1.1 Analysis of the Best Regression Equations ............................................................ 26 5.1.2 Conclusion for MLR ............................................................................................... 27 5.2 Random forest regression ............................................................................................... 28 5.2.1 Using 1 tree with all variables ................................................................................ 28 5.2.2 Using 100 trees with all variables: .......................................................................... 28 5.2.3 Using 500 trees with all variables ........................................................................... 29 5.2.4 After backward elimination process ....................................................................... 30 5.2.5 1 tree with statistically significant variables: .......................................................... 30 5.2.6 100 tree with statistically significant variables: ...................................................... 31 5.2.7 500 trees with statistically significant variables: .................................................... 32 5.2.8 Conclusion for Random forest regression: ............................................................. 33 5.3 Polynomial regression data Analysis and results: .......................................................... 34 5.3.1 Conclusion for Polynomial regression: ................................................................... 36 5.4 Comparison between the Algorithms: ............................................................................ 37 5.5 Comparison of models with other competitors: ............................................................. 38 6 Conclusion ............................................................................................................................ 43 7 References: ............................................................................................................................ 44 8 Appendices ............................................................................................................................ 45 P a g e | 4 Abstract This project carried out a systematic investigation to predict the final price of each home using machine learning techniques. Various machine learning techniques such as multiple linear regression (base model), random forest regression and polynomial regression were applied to the dataset to compare the results. The data describes the sale of individual properties, various features and details of each home in Ames, IW from 2006 to 2010. The dataset comprises of 80 explanatory variables which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The programs were implemented using Python, by using core libraries like pandas, scikit–learn, NumPy. Backward elimination algorithm is applied in building optimal model and selection of features over 270 independent variables with approximately 7,91,320 observations. K-fold cross validation technique is used to measure the performance of all the models. A good high R- squared values with low variance are recorded for linear models. In order to select a good prediction model, all the regression models are explored and compared with each other. Results from K fold cross validation indicates high R-squared values for MLR and Random forest, stating a high level of performance when applied on an actual test set. Each model is evaluated with kaggle score checker. My Random forest model achieved the score of 0.14696, which is better compared to my base model Multiple linear regression (kaggle score 0.16854) and Polynomial regression (kaggle score 0.24399). P a g e | 5 1 Introduction If you come across any random home buyer questioning them about their dream house, then there are high chances that their descriptions would not start off describing the various aspects of house like the height of basement ceiling or the nearness to a commercial building. Thousands of people seek to place their home on market with the motto of coming up with a reasonable price. Generally, assessors apply their experience and common knowledge to gauge a home based on its various characteristics like its location, commodities and its dimensions. But, regression analysis comes up with another approach which provides much better home prices with reliable predictions. Better still, assessor experience can help guide the modeling process to fine- tune a final predictive model. So, this model will help for both the home buyers and home sellers. There is ongoing competition hosted by Kaggle.com from where I am gathering the required data set [1]. The dataset of the competition furnishes good amount of info which helps in price negotiations than the other features of home. This dataset also supports advanced machine learning techniques like random forests and gradient boosting. 1.1 Dataset The dataset comprises 80 explanatory variables, which expounds features comprehensively of the residential homes in Ames, Iowa from 2006 to 2010 [2]. The final goal of the project is to predict the final price of each home with the help of powerful analysis on data set. The data set compromises of 2920 observations and a wide range of explanatory