CALIFORNIA STATE UNIVERSITY SAN MARCOS

PROJECT SIGNATURE PAGE

PROJECT SUBMITIED IN PARTIA L FULFILLMENT OF Tl IE REQUTREMENTS FOR Tl IE DEGREE

MASTER OF SCIENCE

IN

COMPUTER SCIENCE

PROJECT TITLE: MACHINE LEARNING MODEL S TO PREDK1' HOUSE PRJCES BASED ON HOME FEATURES

AUTHOR: Vcnkat Shiva Pandiri

DATE OF SUCCESSFUL DEFENSE: 07121/2017

THE PROJECT HAS BEEN ACCEPTED BY THE PROJECT COMMITTEE IN PARTIAL FULFILLMENT OF TflE REQUIREMENTS l~ OR TllE DEGREE OF MASTER OF SCIENCE JN COMPUTER SCIENCE.

Dr. Xiaoyu Zhang z/~ 7 /21(i =t- PROJECT COMMIITEE CHAIR SIGNATUJU: DATE

Dr. Xin Ye 'XJA Ye 7/zt!11 PROJECT COMMITTEE MEMBER S I ~ DATE P a g e | 1

Machine Learning Models to Predict House Prices based on Home Features Venkat Shiva Pandiri

California State University San Marcos

P a g e | 2

Contents

1 Introduction ...... 5

1.1 Dataset ...... 5

2 Related Work ...... 7

2.1 The applications for real estate with Machine Learning Technology ...... 7

2.1.1 The fraud finding methodology in Zillow is as follows: ...... 8

2.2 Property Analysts and Economists ...... 9

2.3 Machine Learning Algorithms ...... 10

2.3.1 Multiple linear regression ...... 10

2.3.2 Random forest Regression: ...... 11

2.3.3 Polynomial Regression ...... 12

3 Data Preprocessing...... 12

3.1 Importing the libraries...... 13

3.2 Getting the dataset...... 13

3.3 Handling Missing data ...... 14

3.4 Encoding categorical Data ...... 15

4 Methods and implementations ...... 16

4.1 Steps for building model ...... 16

4.1.1 All in (not a technical word): ...... 16

4.1.2 Feature Selection: ...... 17

4.1.3 Pearson Correlation Test ...... 17

4.1.4 Outliers ...... 19

4.1.5 Multicollinearity test ...... 20

4.1.6 Backward elimination: ...... 23

4.2 K-Fold cross validation...... 24 P a g e | 3

Problems with splitting dataset into training and testing sets ...... 24

5 Results ...... 25

5.1 Multiple linear regression data Analysis and results: ...... 25

5.1.1 Analysis of the Best Regression Equations ...... 26

5.1.2 Conclusion for MLR ...... 27

5.2 Random forest regression ...... 28

5.2.1 Using 1 tree with all variables ...... 28

5.2.2 Using 100 trees with all variables: ...... 28

5.2.3 Using 500 trees with all variables ...... 29

5.2.4 After backward elimination process ...... 30

5.2.5 1 tree with statistically significant variables: ...... 30

5.2.6 100 tree with statistically significant variables: ...... 31

5.2.7 500 trees with statistically significant variables: ...... 32

5.2.8 Conclusion for Random forest regression: ...... 33

5.3 Polynomial regression data Analysis and results: ...... 34

5.3.1 Conclusion for Polynomial regression: ...... 36

5.4 Comparison between the Algorithms: ...... 37

5.5 Comparison of models with other competitors: ...... 38

6 Conclusion ...... 43

7 References: ...... 44

8 Appendices ...... 45

P a g e | 4

Abstract

This project carried out a systematic investigation to predict the final price of each home using machine learning techniques. Various machine learning techniques such as multiple linear regression (base model), random forest regression and polynomial regression were applied to the dataset to compare the results. The data describes the sale of individual properties, various features and details of each home in Ames, IW from 2006 to 2010. The dataset comprises of 80 explanatory variables which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The programs were implemented using Python, by using core libraries like pandas, scikit–learn,

NumPy. Backward elimination algorithm is applied in building optimal model and selection of features over 270 independent variables with approximately 7,91,320 observations. K-fold cross validation technique is used to measure the performance of all the models. A good high - squared values with low variance are recorded for linear models. In order to select a good prediction model, all the regression models are explored and compared with each other. Results from K fold cross validation indicates high R-squared values for MLR and Random forest, stating a high level of performance when applied on an actual test set. Each model is evaluated with kaggle score checker.

My Random forest model achieved the score of 0.14696, which is better compared to my base model Multiple linear regression (kaggle score 0.16854) and Polynomial regression (kaggle score

0.24399). P a g e | 5

1 Introduction

If you come across any random home buyer questioning them about their dream house, then there are high chances that their descriptions would not start off describing the various aspects of house like the height of basement ceiling or the nearness to a commercial building.

Thousands of people seek to place their home on market with the motto of coming up with a reasonable price. Generally, assessors apply their experience and common knowledge to gauge a home based on its various characteristics like its location, commodities and its dimensions. But, regression analysis comes up with another approach which provides much better home prices with reliable predictions. Better still, assessor experience can help guide the modeling process to fine- tune a final predictive model. So, this model will help for both the home buyers and home sellers.

There is ongoing competition hosted by Kaggle.com from where I am gathering the required data set [1]. The dataset of the competition furnishes good amount of info which helps in price negotiations than the other features of home. This dataset also supports advanced machine learning techniques like random forests and gradient boosting.

1.1 Dataset

The dataset comprises 80 explanatory variables, which expounds features comprehensively of the residential homes in Ames, Iowa from 2006 to 2010 [2]. The final goal of the project is to predict the final price of each home with the help of powerful analysis on data set. The data set compromises of 2920 observations and a wide range of explanatory variables; the train set contains 1461 observations and test set contains 1460 observations 23 of which nominal, 23 ordinal, 14 discrete, and 20 continuous used for evaluating home values [2].

The variables in the data set has a prime focus on various standards of several physical features of the residence. Majority of the variables are just the type of information in which a typical home buyer would be interested for a good home. For example, the home buyer will have frequent questions such as when was it built? How big is the lot?

Generally, there are 20 continuous variables which relate to variety of area dimension of each observations [2]. Moreover, there are more specific variables quantified in data set other than the peculiar lot size and total dwelling square footage found in most common host listing, Linear feet P a g e | 6 of street connected to the property, total square feet of basement area connected to property, size of garage.

There are 14 discrete variables which typically measures the number of items occurring within the house and most of them are targeted on number of kitchens, bedroom, and bathrooms [3]. It also contains the information about Total rooms above grade which does not include bathrooms, number of fire places, size of garage in car capacity. The dataset has records of month/year of the property sold.

This data set contains vast number of categorical variables which includes 23 nominal and 23 ordinal. And, they range from 2 to 28 classes, STREET being the smallest among them and NEIGHBORHOOD being the largest [3]. Nominal variables also contain the info about types and conditions of sale, location of the garage, central A.C details, type of foundation of home like brick & tile/cinder block/slab/stone/wood and also Miscellaneous feature not covered in other Categories [3].

In this project, I have done Regression analysis on the dataset using various machine learning technologies. The main goal of this project is creating and comparing several machine learning models using various algorithms like multiple linear regression, Polynomial regression, random forest to analyze the relationship between SalePrice and other 270 independent variables [2].

Data Preprocessing is the first step in cleaning the data, it plays key role in finding the useful information for the regression analysis to give the better predictions. After data pre-processing, various regression analysis is done on the dataset to find the SalePrice of the home. In this project, I used python as programming language, Jupyter, spyder as IDE and Microsoft PowerBI for data visualization.

In this project first I applied, multiple linear regression to construct a base model for this problem. Then I developed two additional models using random forest and polynomial regression and compared the results to base MLR model. K fold cross validation technique is used to measure the performance of all the models. Multiple Linear regression and Random forest regression performed good whereas polynomial regression gave poor results.

The primary objective of this project is to assess the sale price of the houses based on the features. After developing the machine learning models, each machine learning model is scored with R P a g e | 7 square, Multiple linear regression scored R square of 93.6% and 98.4 % of R square is noticed with Random forest regression. Later each model is evaluated with kaggle score checker, my Random forest model achieved the score of 0.14696 which is better than my base model Multiple linear regression and Polynomial regression model. Results are evaluated on Root-Mean-Squared- Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price [1]. After observing other competitor kernels in the competition, I noticed most of the competitors used XGBoost and lasso algorithms. I also noticed various data preprocessing techniques used for building the models.

In the following sections of this paper first I will discuss about the related research information about machine learning technologies used in real estate business. This section also provides good information about machine learning algorithms used in this work to provide better insights. Data preprocessing section is the major part of this report, this section contains information about data cleaning i;e how raw data is transformed to useful/understandable format. Importing the necessary libraries, dealing with missing values, encoding categorical variables and various methodologies are provided in this section. From here methods and implementations will be discussed; this section is focused on improving the model using various techniques and methodologies, while data is explored further and unwanted data is eliminated using backward elimination, Pearson correlation and multicollinearity tests, as well as k-fold cross validation technique is used to evaluate the performance of the model. The result section contains the analysis part of the project, a brief conclusion is also discussed for every algorithm used. This section provides good understanding of the results and performance of each models in predicting the better Sale Price. Finally, the conclusion is discussed followed by future work and data distribution.

2 Related Work

2.1 The applications for real estate with Machine Learning Technology

Machine Learning technologies brought a scientific revolution in Business Industries. It unlocked a unique functionality in Banking, Software, Medicine, Real estate and many more industries in growing their business and satisfying their clients. P a g e | 8

Many of the top notch real estate websites are using machine learning technologies to predict the value of every piece of real estate property accurately to delight their customers. Adopting and integrating machine learning technologies improved customer home buying experience and helped them prepare and optimize their home for sale.

Zillow is the largest real estate giant on the web having information about more than 110 million U.S homes. 4 out of 5 U.S homes have been viewed on Zillow [14]. Senior director of Zillow for data science and engineering says that they had seen rapid growth in their business after introducing ‘Zestimate’, which is one of the ways Zillow uses machine learning [14]. Their first home valuation model was ‘Zestimate’ which comprises hundreds of machine learning models. ‘Zestimate’ predicted the valuation of every single home in the country -- linear models, decision trees, deep learning and more are used in building ‘Zestimate’ [14].

20TB of data is stored in Zestimate’s database according to a presentation by Zillow’s Senior

DataScientist Nick McClure [14]. The company tracks 103 attributes for each property going back 220 months. Various high powerful tools are used in building Zestimate through a series of processes including Python, R, Pandas, Scikit-Learn.

Zillow is making use of machine learning for various purposes also mainly to improve the accuracy of error and fraud detection. The data science team in Zillow uses a combination of Scikit-Learn, a collection of Python-based data mining and machine learning tools, as well as Dato’s GraphLab[14] are used to flush out thieves. Figure 1 shows fraud listing on Zillow by fraudsters. A good fraud detection is reported with 96.9% accuracy [14].

2.1.1 The fraud finding methodology in Zillow is as follows:

• They gather lots of information about property: attributes (bedrooms, bathrooms, kitchen, ….), Address, pricing data, transactional data, account information and unstructured text descriptions. • Creating features based on the information. • Train a random forest with features on known fraudulent and non-fraudulent listings. • Output is scored (fraud=P(fraud)>0.5) as actual fraud or not. • Scored data is used in the fraud model weekly for training. P a g e | 9

Figure 1 Using Machine learning for fraud detection 2.2 Property Analysts and Economists

The project of predicting UK housing market usually falls within the province of property agents such as Foxtons3, Savills4 and economic departments in prime universities [21]. Predictions are made through miscellaneous factors

• Market sentiments • Market data • Economic performance • Surveyor valuation • Inflation and may others

They usually cover the entire city and the conjecture values are isolated for the districts and major cities. These predictions are generally published in the news, or through various publications which are made available to property investors. Figure 2.2 shows one such example of a publication on predictions of UK housing by savills, predicting UK housing prices from 2015 to 2019[15]. However, these auguries are often non-specific and averaged across different property types and locations within a certain terrian.

Future anticipation for a semi-detached house in a desired suburb are likely to differ from that of a flat in a less-desired suburb. However, knowing the general divination of London housing prices, it would not help a vendee to decide the location or the property type which he/she should consider P a g e | 10 to buy in London. Further-more the fact that these predictions are often made available through mass media does not allow for customizability and interactivity from the users.

Figure 2-1 Predicting house prices in London from 2015 to 2019

2.3 Machine Learning Algorithms

2.3.1 Multiple linear regression

Multiple linear regression is one of the technique in machine learning & that defines the relation between multiple explanatory variables and response variable for predicting the outcome of a response variable [9].

In a simple linear regression model, a single response measurement Y is related to a single predictor (covariate, regressor) X for each observation.

• X is donated as the predictor or independent variable. • Y is denoted as response or dependent variable.

Equation for simple linear regression (Y |X) = α + βX.

P a g e | 11

In most problems, more than one predictor variable will be available. This leads to the following “multiple regression” mean function for ‘n’ observations:

For, example if Y is the final sale price of home, X1 is Bedroom (Number of bedrooms), X2 is LotArea, X3 is Kitchen (Number of kitchens). Then the population mean function will be:

Based on this mean function, we can determine the expected home price for any home as long as we know lot area, number of bedrooms, and number of kitchens.

2.3.2 Random forest Regression:

Random forest is part of ensemble learning for machine learning techniques like classification, regression and other tasks [10].

To predict the price of each home first we will construct the decision trees using random forest technique. Each decision tree is constructed by using a random subset of training data. After training the forest, we can then pass each test row through it, in order to output a prediction. This technique can give accurate predictions and runs efficiently on large data, and consist of following steps.

1. Build the decision tree associated to these K data points. 2. Choose the number Ntree of trees to be built and repeat steps 1&2. 3. For new data point, make each of Ntrees predict the value of Y to for the data point in problem, and assign the new data point the average across all the predicted Y values. P a g e | 12

Figure 2-2 Random forest trees 2.3.3 Polynomial Regression

Polynomial regression is a form of non-linear regression, which defines the relationship between the independent variable x and the dependent variable y is modelled as a nth degree polynomial in x [8]. Polynomial regression technique gives good results when there is large numbers of variables. This regression technique is used to predict the final price of the each home. By finding a correlation between SalePrice variable and other variables. The model for Polynomial linear regression, given n observations, is:

3 Data Preprocessing

This section discusses the steps to prepare the original data into format that can be used for building machine learning models. P a g e | 13

3.1 Importing the libraries

In this project, I used python’s powerful libraries to make the machine learning models efficient. Majorly three essential libraries NumPy, Pandas, Sci-kit learn had been used in all the machine learning models.

NumPy is a powerful library for implementing scientific computing with Python. The most important object of NumPy’s is the homogeneous multidimensional array[16]. NumPy saves us from writing inefficient and tiresome huge calculations. NumPy provides a way more elegant solution for mathematical calculations in python. It provides an alternative to the regular python lists. Numpy array is similar to a regular python list with one additional feature. You can perform calculations over all entire arrays easily, super-fast as well.

Pandas is a flexible open source python library with high performance, flexible and expressive data structures. Pandas works better with relational and labeled data. Though python is great for data munging and preparation, python lags great in practical, real world data analysis and modeling [17]. Pandas helps great in filling these gaps. It is called the most powerful tool for data analysis and data manipulation.

Scikit-learn is a great open source package providing a good chain of supervised and unsupervised algorithms [18]. Scikit-learn is built up on scientific python(SciPy). This library is primarily focused on modeling data. Few popular models of Scikit-learn are clustering, cross validation, ensemble methods, feature extraction and feature selection [18].

3.2 Getting the dataset

In this section I will discuss how to load a dataset. In this project, pandas library was used to load all the dataset files. Pandas is powerful and very efficient in analyzing the data and also enables us to read the data of different formats. I choose CSV format because it is very easy to transfer huge databases between the programs.

Read_csv pandas function is used in reading the data. This function assumes that the fields are comma separated by default. When a CSV is loaded, we get a kind of object called a DataFrame, which is made up of rows and columns. Part of a dataframe is shown in Figure 4 below. P a g e | 14

Figure 4 Data Frame

3.3 Handling Missing data

The important part and problem of data preprocessing is handling missing values in the dataset.Data scientists must manage missing values because it can adversely affect the operation of machine learning models. Data can be imputed in such a procedure, missing values can be filled based on the other observations.

Techniques involved in imputing unknown or missing observations include:

1. Deleting the whole rows or columns with unknown or missing observations.

2. Missing values can be inferred by averaging techniques like mean, median, mode.

3. Imputing missing observations with the most frequent values.

4. Imputing missing observations by exploring correlations.

5. Imputing missing observations by exploring similarities between cases.

P a g e | 15

Figure 5 Missing values in the dataset Missing values are usually represented with ‘nan’, ’NA’ or ‘null’(Refer image 5). Below is the list of variables with missing variables in the train dataset.

• BsmtFinType2, BsmtFinType1, BsmtExposure, MasVnrType, Alley, FireplaceQu, FireplaceQu, Garage Type, GarageFinish, GarageCond, Fence are the categorical variables with missing values. These missing values are replaced by a key word ‘VARIOUS’.

Numerical missing values are replaced by mean of the respective mean of the column. Variables which contains more than 15% missing values are not good for the model. 'Alley', 'Fence',

'FireplaceQu' are eliminated from the model since more than 15% of data is missing.

3.4 Encoding categorical Data

Regression analysis can be done only with numerical variables. Crucial information is also provided by categorical variables in the dataset. Focusing alone on numerical values pulls down the performance of the model and does not give good predictions, Categorical variables has equal importance too. Since we cannot model with categorical variables, this problem can be solved by converting categorical variables into numerical variables/dummy variables. Numerical values are allotted corresponding to their categories and they are represented by 0 or 1’s. 1 is given if they P a g e | 16 are in the treated group and 0 is given if they are in control group. Here the no. of columns will be equal to the no. of categories.

In this project, I used pandas library to deal with categorical variables. ‘get_dummies’ is the function used to convert categorical variable into dummy variables/numerical values.

Figure 6 Changing Characters to dummy variables

4 Methods and implementations

4.1 Steps for building model

4.1.1 All in (not a technical word):

Throwing all the variables into the model and consider them all as true predictors. First, I created a predictive model with all 270 predictors for SalePrice, I noticed lot of unwanted variables in the model which are weakening the prediction power of the model. As this project contains many potential predictors for dependent variable, it’s very tough to choose the variables. Most of the times few of the variables are not useful to build the model and these useless variables need to be removed for building efficient model. If you throw lot of stuff into trash can it is not healthy for the home likewise throwing a lot of variables into the model makes the model bad and nonreliable.

It is tough to understand and explain the variables if the dataset contains too many variables. Only certain variables which predict the behavior of the dependent variable must be taken for the model. P a g e | 17

It is not going to be practical to try and explain if dataset contains hundreds of variables. Very important variables need to be considered that actually predict something.

4.1.2 Feature Selection:

Feature selection is also called variable selection or attribute selection. It is the automatic selection of attributes in the dataset that are most relevant to the predictive modeling problem. Feature selection methods aid to create an accurate predictive model. It helps in choosing features that will give good or better accuracy while requiring less data.

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may decrease the accuracy of model. There are few variables in the train set which is irrelevant to test set variables. These variables are removed to improve the accuracy. In this project, Pearson Correlation test and backward elimination techniques are used to select the statistically significant variables.

4.1.3 Pearson Correlation Test

The Pearson correlation test was used in this project in order to understand collinearity between the numerical variables from the predictive model. The test concluded the correlation coefficient between Sale Price and other independent variables, figure 7 shows the Pearson correlation values between Sale Price and independent variables. Values above the threshold line(threshold value =0.3) in figure 7 shows the strongest correlation between Sale Price and independent variables. P a g e | 18

Figure 7 Pearson correlation test between SalePrice and other independent numerical variables.

P a g e | 19

Figure 9 Graphs between SalePrice and strongly correlated variables OverallQual, GrLivArea, GarageCars, GarageArea, GarageYrBlt, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd shows the strong correlation with Sale Price. Figure 9 shows the data of independent variables and sale price going together, these variables may play a key role in predicting the home prices.

4.1.4 Outliers

An outlier is a point that falls away from the cloud of points. I ran the analysis both with and without outliers. I found that outliers changed the results and affected assumptions; therefore, I P a g e | 20 dropped the outliers in the model. Figure 10 shows the variables with the outliers. Figure 11 shows the variables after eliminating the outliers.

Figure 10 variables with outliers

Figure 11 variables after eliminating outliers 4.1.5 Multicollinearity test

Furthermore, in order to understand the multicollinearity from the predictive model for Sale Price, heat map is used to detect the multicollinearity between the variables. Heatmap is a great way to explore relationship between the variables. P a g e | 21

In regression, "multicollinearity" refers to predictor variables that are correlated with other predictor variables. In the regression model multicollinearity occurs when the model contains multiple features that are correlated with dependent variable and other predictor features.

Problem with multicollinearity:

1. Standard errors of the coefficients increases with multicollinearity.

2. Makes some significant variables to statistically insignificant.

3. Undermines the results of the model.

To avoid multicollinearity:

1.Completely remove highly correlated features

2.Make new feature by adding them together.

3. Use PCA, which reduces the feature set to small number of non-collinear features.

P a g e | 22

Figure 12 Correlation Heatmap Figure 12 shows correlation results between Sale Price, ‘OverallQual’, ‘GrLivArea’, ‘GarageCars’, ‘GarageArea’, ‘GarageYrBlt’, ‘TotalBsmtSF’, ‘1stFlrSF’, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd. Black squares indicate the highly-correlated results between the variables. From figure 10, 'OverallQual', 'GrLivArea' and 'TotalBsmtSF' are strongly correlated with the Sale Price.Based on the heat map.

Multicollinearity exists between the below features: P a g e | 23

• GarageArea and GarageCars with correlation with 0.89.

• TotalBsmtSF and 1stFlrSF with correlation with 0.8.

• TotRmsAbvGrd and 'GrLivArea, with correlation 0.83.

• YearBuilt and GarageYrBlt with correlation with 0.82.

GarageArea and GarageCars, TotalBsmtSF and 1stFlrSF, TotRmsAbvGrd and GrLivArea, YearBulit and GarageYrBlt are strongly correlated. Since multicollinearity exists between these variables, we need to choose one variable for analysis. I chose 4 features GarageArea, TotalBsmtSF, GrLivArea, YearBulit for analysis since it has a higher correlation with Sale Price compared to GarageCars.

4.1.6 Backward elimination:

To improve the model further Backward elimination technique is used for selecting the highly statistically significant numerical variables.

Steps in Backward elimination:

1. Select a significance level to stay in the model (e.g SL=0.05) Fit the full model with all possible predictors. 2. Consider the predictor with the highest P-value. If P > SL, go to STEP 4, Otherwise go to FIN. 3. Remove the predictor. 4. Fit model without this variable.

'OverallCond','KitchenAbvGr','Id','YearRemodAdd','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','Lo wQualFinSF','HalfBath','GarageYrBlt','GarageArea','OpenPorchSF','EnclosedPorch','3SsnPorc h','PoolArea','MiscVal','MoSold' following 17 variables doesn’t fit for the model as their corresponding p-values are above the significance level 0.05.

P a g e | 24

4.2 K-Fold cross validation.

Problems with splitting dataset into training and testing sets:

Cross validation is a model evaluation method that is better than residuals. This is the basic idea for a whole class of model evaluation methods called cross validation. The holdout method is the simplest kind of cross validation.

In holdout cross validation technique, a dataset is divided into train and test sets, train set contains 75% of data and validation set contains 20%(refer fig 13). But there is an inherit trade off in which every data point takes out of the training set into the test is lost for the training set (refer fig 13). So, to overcome this problem I used K fold cross validation technique. K fold technique chooses the maximum number of data points in the train set to get the best learning results. It chooses the maximum data points in test set for best cross validation techniques.

Figure 14 K fold bins K-fold cross validation divides the dataset into K random subsets (refer fig 14). Each random k set is used as a validation set while the remaining data is used as a training set to fit the model. In total, k models are fit and k validation statistics are obtained. The model giving the best validation stastic is chosen as the final model. P a g e | 25

In this project, I used the R-square K fold cross validation technique for scoring all the models. R- Square is defined as 1- ratio of residual variability.

“R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression” [20].

It is also called Coefficient of determination. It is the statistic technique to evaluate the model fit. R Square values are scaled between 0 and 1.

5 Results

5.1 Multiple linear regression data Analysis and results:

The 1st predictive model was regression analysis with all 270 predictors for SalePrice First, I used the regression to test the predictive model with all 270 predictors for Sale Price.

However, the multiple coefficient of determination R2 (86.6%) is very large. So, the data fit is very good for the predictive model for Sale Price. I found the Pearson Correlation between all numerical variables OverallQual, GrLivArea, GarageCars, GarageArea, GarageYrBlt, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd were larger than the rest. Subsequently my target is to eliminate multicollinearity. Heatmap indicates multicollinearity exists in the model. GarageCars, 1stFlrSF, TotRmsAbvGrd, GarageYrBlt are eliminated from the model as they are strongly correlated with other variables.

Using the Backward elimination, I found that 17 numerical variables are statistically significant. So, I considered 17 variables for the regression model and eliminated the rest of the variables.Id, YearRemodAdd, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, LowQualFinSF, HalfBath, GarageYrlt, P a g e | 26

GarageArea, OpenPorchSF, EnclosedPorch, 3SsnPorch, PoolArea, MiscVal, MoSol are set of non-statistically significant variables

The 2nd predictive model was regression analysis with 226 predictors for Sale Price. After finding the best variables for regression, I used ordinary least squares to analyze the second predictive model after non-statistically significant predictor variables were excluded. Given this regression equation by regressor, I found that the R square is 93.6%, which looks very good. So there was no multicollinearity issue existing for this new predictive model.

5.1.1 Analysis of the Best Regression Equations

After analyzing the Pearson Correlation test and plotting heat map to be GarageArea and GarageCars, TotalBsmtSF and 1stFlrSF, TotRmsAbvGrd and GrLivArea, YearBulit and GarageYrBlt are strongly correlated with each other. I considered leaving out one of the two variables. I used the regression analysis to figure out which one to exclude from my equation. During my regression analysis, I left out GarageCars, 1stFlrSF, TotRmsAbvGrd, GarageYrBlt. I found the R Squared value to be 93.6%, which shows a good result than the original model with all the predictors. Since the multiple coefficient of determination (R squared) is close to 1, it presents a very good fit

K fold cross validation is used to measure the performance of Multiple linear regression. The Train dataset is divided into 10 bins, high R squared with low variance are noticed for all the bins.

P a g e | 27

Figure 15 R values of final regression model 7 out of the 10 bins shown high R squared values (refer fig 15), which are approximately equal to 0.9(90%) which is close to 1. This proves that model is performing well in predicting the house prices.Finally, I compared my model with other competitors in kaggle competition, my MLR model Scored 0.16854. Results are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the competitors predicted values and the logarithm of the observed sales price [1].

5.1.2 Conclusion for MLR

From my analysis utilizing multiple methods of data processing technique, I have determined an acceptable multiple linear regression model to the data. First, I created a model with all variables and noticed the R square value of 86% and Durbin-Watson:2.02. Since there exists multicollinearity between features, I picked features which are strongly correlated to the Sale Price in the prediction model. The second model includes 226 variables and noticed R square of 93.6% and Durbin-Watson:1.936. Applying regression analysis, backward elimination, Pearson correlation test and k-fold cross validation technique, I obtained the optimal linear regression prediction functions. Later I submitted my model in kaggle competition and obtained a score of 0.16854.

P a g e | 28

5.2 Random forest regression

5.2.1 Using 1 tree with all variables

The 1st predictive model is built with all predictors for Sale Price. First, I used the regression to test the predictive model with all predictors for Sale Price.I created a regression model using 1 tree with all the variables. I did the same data preprocessing for Random forest for what I did for the Multiple linear regression. Using k fold cross validation technique, I noticed the following R square values:

Figure 16 Rsquared values of 1 tree After analyzing this model with my previous multiple linear regression model, I found the results to be very poor. To further analyze, I increased the estimators(trees) using all the variables.

5.2.2 Using 100 trees with all variables:

I considered taking 100 trees with all the variables. Using the K fold cross validation technique, I noticed little improvement as compared with my previous random forest model. P a g e | 29

Figure 17 Rsquared values of 100 decision tree

I noticed good R squared values with low variance but the results were not satisfactory as compared with the base model multiple linear regression. I still tried to improve the model with training more trees.

5.2.3 Using 500 trees with all variables

I considered taking 500 trees with all the variables. Using the K fold cross validation technique, I noticed a very great improved model as compared with my previous random forest models.

P a g e | 30

Figure 17 Rsquared values of 500 decision tree

Though the R square values are good, there still exists multicollinearity between the variables and backward elimination technique which can be used to select the best features for improving the model further to get the best predictions of the home price.

5.2.4 After backward elimination process

5.2.5 1 tree with statistically significant variables:

First, I created a regression model using 1 tree with statistically significant variables. Using k fold cross validation technique, I noticed the following R square values:

P a g e | 31

Figure 18 Rsquared values of 1 decision tree

I founded the results to be very poor. I increased the estimators(trees) for analyzing further using these variables.

5.2.6 100 tree with statistically significant variables:

I considered taking 100 trees with statistically significant variables. Using the K fold cross validation technique, I noticed little improvement as compared with my previous random forest models. P a g e | 32

Figure 19 Rsquared values of 100 decision tree

6 out of 10 bins shown high R squared values which are approximately equal to 0.9(90%) which is close to 1, this proves that model is performing good in predicting the house prices. I still tried to improve the model with training little more trees.

5.2.7 500 trees with statistically significant variables:

I considered taking 500 trees with statistically significant variables. Using the K fold cross validation technique, I noticed a very great improved model as compared with my previous random forest models.

P a g e | 33

Figure 20 Rsquared values of 500 decision trees I noticed good R squared values with low variance but still the results are not satisfactory as compared with the random forest model with 100 trees. Prediction power is getting decreasing as increasing the trees.

5.2.8 Conclusion for Random forest regression:

Figure 21 comparision between 1,10,100 tree models From my analysis utilizing multiple methods of random forest trees, I have determined a acceptable random forest regression model to the data. First, I created a model with all 270 P a g e | 34 variables with 1 tree and noticed poor R square values. Later, I found results getting better after trying increasing the estimators. After backward elimination, Pearson correlation test I created a model with 100 estimators, using k-fold cross validation technique I obtained the optimal linear regression prediction functions with good R-squares (refer fig 21). Finally, I compared my model with other competitors in kaggle competition, my Random forest model Scored 0.14696.

5.3 Polynomial regression data Analysis and results:

The 1st predictive model with all 270 predictors for SalePrice. First, I used the polynomial regression to test the predictive model with all 270 predictors for Sale Price.

I did the same data preprocessing for polynomial regression for what I did for the previous models. Using k fold cross validation technique, I noticed the following R square values:

Figure 22 R values of polynomial regression before backward elimination using k fold

I noticed very poor R square values when compared to my other previous models. I used pearson correlation for finding the collinearity between the variables. After analyzing the Pearson Correlation test and finding the highly-correlated variables, I considered leaving out one of the two variables. I eliminated GarageCars, 1stFlrSF, TotRmsAbvGrd, GarageYrBlt as it is highly P a g e | 35 correlated with other variables. To improve the model further I used backward elimination process and eliminated the insignificant statistical variables. I found that 226 variables are statistically significant. So, I considered 226 variables for the regression model and eliminated the rest of the variables.

K fold cross validation is used to measure the performance of polynomial regression. Train dataset is divided into 10 bins, a bad R squared high low variance are noticed for all the bins.

Figure 23 R values of final polynomial regression using k fold

P a g e | 36

5.3.1 Conclusion for Polynomial regression:

Figure 24 Comparision between before data preprocessing and after data preprocessing From my analysis using polynomial regression, I have determined polynomial regression model to the data. First, I created a model with all 270 variables and noticed poor R square values (refer figure 24). To improve and create a optimal model, I used backward elimination, Pearson correlation test k-fold cross validation technique I obtained the optimal linear regression prediction functions. Compared to my previous models, Multiple linear regression and Random forest regression, and polynomial regression gave poor predictions of Sale Price and also took a longer time to process. Polynomial regression scored negative Rsquares due to the lack of suit for this dataset since the relationship between the predictors and sale price are linear. R square is negative only when the chosen model does not follow the trend of the data, so fits worse than a horizontal line [19]. Polynomial regression provides good predictions for non-linear relationships. Finally, I compared my model with other competitors in kaggle competition. My Polynomial regression model Scored 0.24399.

P a g e | 37

5.4 Comparison between the Algorithms:

Figure 25 comparison of K-fold between algorithms The line graph (figure 25) shows the comparison of K-fold R-squares between the 3 algorithms MLR, Random forest and polynomial regressions. From the graph, we can deduct that MLR and Random forest attained better R squares compared to Polynomial regression. MLR have similar trends alongside Random forest. 7 out of 10 bins of MLR depicted the R-values to be higher than 0.9 with low variance, indicating a good model fit. Similarly, 6 out of 10 bins of Random forest shows that the R-values are higher than 0.9 with low variance, stating that the model also has good fit. The graph shows different trend for polynomial regression as strong fluctuations are seen through dramatic rise and fall of R squares, which represents poor variance between the R square values. None of the bins attained good R-square value with 9th bin being the highest with 0.85 and 2nd bin being the lowest with 0.5 and rest of the bins are noted below 0.8 which represents poor R values. A good R-square indicates accurate predictions, therefore poor R values contradict that result with a higher chance for false predictions. Results from K fold cross validation indicates high R-squared values for MLR and Random forest, stating a high level of performance when applied on an actual test set. Each model is evaluated with kaggle score checker. My Random P a g e | 38 forest model achieved the score of 0.14696, which is better compared to my base model Multiple linear regression (kaggle score 0.16854) and Polynomial regression (kaggle score 0.24399).

Figure 26 Houses vs corresponding predicted house prices The above (fig 26) scatter plot shows predicted house prices. Random forest and MLR shows similar trends on the graph. The highest expense for a home predicted in MLR is $700,000. Random forest predicted the highest expense for a home to be $620,000, while Polynomial predicts $1,000,000. Since we conclude Random forest to be the best prediction model, prices predicted by Random forest regression are considered the final price.

5.5 Comparison of models with other competitors:

Through model comparisons with fellow competitors, lasso regression algorithm proved to have performed better than other algorithms. Since there exists lot of multicollinearity in the dataset, I discovered regression regularization methods (Lasso, Ridge and ElasticNet) to bring better results. My research that I present to you is the product of comparison with co-competitors Pedro Marcelino Boris’s, Eliot’s and Zubair’s models. By analyzing their research, I also discovered that my model required better data preprocessing and regularization. P a g e | 39

One of the reasons why I came into conclusion that LASSO model is better: LASSO gave good predictions for the sale price since it contains good shrinkage and variable selection method for linear regression. LASSO shrinkage process improved the Boris and Paipu’s model for better interpretation, as well as identified the 110 strongly associated variables with the sale price and eliminated the remaining 178 variables from the model. LASSO performs better than Random forest, MLR and polynomial regression since LASSO obtains the subset of predictors minimizing prediction error. Since the dataset contains large number of observations and the relationship between the sale price and other predictors are linear, then OLS regression parameter estimates will have low bias and variance. MLR performs better if the dataset has small number of observations since variance of OLS perimeter estimates will be higher. In this case, LASSO regression can increase the model interpretability. For analysis, OLS chooses the variables that are not associated with the Sale Price, adding complication to my model. In Boris’s LASSO regression model, 178 insignificant predictors were eliminated, and afterwards choose 110 predictors significant to the Sale Price. A tuning parameter labeled lambda with different values of 0.001, 0.0001, 0.005, 0.0005 were choose by Bories. With 0.0005, Lambda got a better score in the public leader board. Lambda with 0.0004 was chosen by Mayumi and attained a score of 0.12102 on public leaderboard. This concludes that as lambda increases, more coefficients are reduced to zero, narrowing the relevant predictors. Lambda controls strength of penalty when applied to regression models.

I also analyzed my random forest model with my co competitor Zubair’s random forest model. My model performed better through comparison since his model has little multicollinearity problems and data is pre-processed poorly. His model attained the score of 0.14765 on public leaderboard where mine scored 0.14696. P a g e | 40

Figure 27 LASSO variable selection The bar chart (figure 27) illustrates the important features discovered by LASSO in Paipu’s model. The bar chart explains that variables going in positive direction are strongly associated with sale price whereas the variables moving in negative direction are not associated with the sale price and are eliminated from the model. Since dataset contains huge number of observations, LASSO has a great feature of selecting important variables needed for the model. This concludes that LASSO saves lot of processing time and manual effort in selecting the features when compared to other algorithms.

P a g e | 41

Table 1 Kaggle scores of different machine learning models

Figure 28 distribution of kaggle scores of various algorithms This bar chart illustrates (figure 28) the scores of various machine learning models. Here we see my polynomial regression scored 0.24399, a poor score when compared to the remaining algorithms. Polynomial regression scored poorly due to the lack of suit for this dataset since the relationship between the predictors and sale price are linear. Polynomial regression provides good predictions for non-linear relationships. The graph shows Boris’s Lasso regression scored first with Mayumi’s being second, individually scoring 0.11720 and 0.12102. The graph indicates there was a fluctuant trend in the scores of other models when compared to the LASSO models. Looking at the detail between my Random forest model and Zubair’s random forest models, both the scores are identical. However, if we look at my MLR score, it’s decent when compared to random forest models. In conclusion, LASSO models attained better scores than the other models, stating better performance on dataset and stronger predictions, giving accurate house prices. P a g e | 42

Figure 29 Venkat’s Models v/s other competitors models

The above (fig 29) scatter plot shows predicted house prices between mine, Mayumi’s, Zubair’s, and Boris’s models. My Random forest and MLR shows similar trends on the graph. The highest expense for a home predicted in MLR is $700,000, while Random forest predicted the highest expense for a home to be $620,000 and Polynomial predicts $1,000,000. Highest expense predicted in Mayumi’s model is approximately $640,000, while Boris’s model predicted highest expense of the home to be nearly $1,400,000. From the graph, we can notice my Random forest model and P a g e | 43

Zubair’s random forest shows the similar trends and highest expense predicted in Zubair’s model for a home to be $540000.

6 Conclusion

Machine Learning technologies brought a scientific revolution in Business Industries. Many of the top notch real estate websites are using machine learning technologies to predict the value of every piece of real estate property accurately to delight their customers. Adopting and integrating machine learning technologies improved customer home buying experience and helped them prepare and optimize their home for sale.

In this paper, I presented machine learning regression models to predict home prices, which helps people to buy or sell their properties without the help of assessors. By using various regression techniques, I am able to predict the prices of homes using 270 home features. By the use of backward elimination and the Pearson coefficient test, I optimized all the feature selection process to build accurate models. From my analysis, I have created acceptable Multiple linear regression, random forest regression and polynomial regression. Using K fold cross validation technique, I measured the performance of all models. After comparing all my models with other competitors’ in kaggle competition, Random forest regression and Multiple Linear regression performed better whereas polynomial regression gave poor results. Applying regression analysis, backward elimination, Pearson correlation test and k-fold cross validation technique, I obtained the optimal linear regression prediction functions. I would like to work on more machine learning business problems in various industries which helps me to setup a great platform to showcase my skills.

P a g e | 44

7 References:

[1] Kaggle competition: Sold! How do home features add up to its price tag? https://www.kaggle.com/c/house-prices-advanced-regression-techniques [2] Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project https://ww2.amstat.org/publications/jse/v19n3/decock.pdf

[3] Documentation: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt [4] Beacon (2011). Local Government GIS for the Web. Retrieved March 24, 2011 from http://beacon.schneidercorp.com [5] City of Ames Iowa (2002). Ames City Assessor Homepage. Retrieved March 24, 2011 from http://www.cityofames.org/assessor/

[6] Aiken, Leona S., Stephen G. West, and Steven C. Pitts. "Multiple linear regression." Handbook of psychology (2003).

[7] Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2.3 (2002): 18-22.

[8] Polynomial regression: https://www.statsdirect.com/help/regression_and_correlation/polynomial.htm

[9] The Advantages & Disadvantages of a Multiple Regression Model:

http://sciencing.com/advantages-disadvantages-multiple-regression-model-12070171.html

[10] The Advantages & Disadvantages of a Random forest Model: https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch- in-python/ \

[11] The Advantages & Disadvantages of a Multiple Regression Model: https://en.wikipedia.org/wiki/Polynomial_regression

[12] Blum, Avrim, Adam Kalai, and John Langford. "Beating the hold-out: Bounds for k-fold and progressive cross-validation." Proceedings of the twelfth annual conference on Computational learning theory. ACM, 1999. P a g e | 45

[13] Rodriguez, Juan D., Aritz Perez, and Jose A. Lozano. "Sensitivity analysis of k-fold cross validation in prediction error estimation." IEEE transactions on pattern analysis and machine intelligence 32.3 (2010): 569-575.

[14]Zestimate:https://www.zillow.com/how-much-is-my-home- worth/?sem=true&semAdgid=48910144228&semMaTy=e&semAdid=183953814392&semKwi d=kwd-2230666391&k_clickid=9a8e7e9a-822e-46cf-8b6b- 74756b73a7cd&gclid=CMW855WayNMCFdSXfgod57gHdw

[15]Savills:http://www.savills.co.uk/blog/article/209620/residential-property/interactive-map-5- year-forecast-of-uk-house-prices.aspx

[16]Numpy: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

[17] Pandas: http://pandas.pydata.org/

[18] Scikit-learn: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

[19] Polynomial regression: https://stats.stackexchange.com/questions/12900/when-is-r-squared- negative

[20] R-square: http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how- do-i-interpret-r-squared-and-assess-the-goodness-of-fit

8 Appendices

List of 80 explanatory variables:

Features Datatypes Id int64 MSSubClass int64 MSZoning object LotFrontage float64 LotArea int64 Street object Alley object LotShape object LandContour object Utilities object LotConfig object LandSlope object Neighborhood object Condition1 object P a g e | 46

Features Datatypes Condition2 object BldgType object HouseStyle object OverallQual int64 OverallCond int64 YearBuilt int64 YearRemodAdd int64 RoofStyle object RoofMatl object Exterior1st object Exterior2nd object MasVnrType object MasVnrArea float64 ExterQual object ExterCond object Foundation object BedroomAbvGr int64 KitchenAbvGr int64 KitchenQual object TotRmsAbvGrd int64 Functional object Fireplaces int64 FireplaceQu object GarageType object GarageYrBlt float64 GarageFinish object GarageCars int64 GarageArea int64 GarageQual object arageCond object PavedDrive object WoodDeckSF int64 OpenPorchSF int64 EnclosedPorch int64 3SsnPorch int64 ScreenPorch int64 PoolArea int64 PoolQC object Fence object MiscFeature object MiscVal int64 MoSold int64 YrSold int64 SaleType object SaleCondition object SalePrice int64

Table 2: List of all variables with data types

Individual variable description is given here (see the documentation file P a g e | 47

http://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt)

Missing variables in Train and Test set: Train set Features No.of Missing values PoolQC 1453 Fence 1178 MiscFeature 1406 GarageQual 81 GarageCond 81 FireplaceQu 690 GarageType 81 GarageYrBlt 81 GarageFinish 81 MasVnrType 8 MasVnrArea 8 Alley 1369 LotFrontage 259

Test set

Variable No. of. Missing values MSZoning 4 LotFrontage 227 PoolQC 1456 Fence 1169 MiscFeature 1408 SaleType 1 FireplaceQu 730 GarageType 76 GarageYrBlt 78 GarageFinish 78 GarageCars 1 GarageArea 1 GarageQual 78 GarageCond 78 KitchenQual 1 Functional 2 Exterior1st 1 Exterior2nd 1 MasVnrType 16 MasVnrArea 15

Table 3: Missing variables in the dataset P a g e | 48