<<

An Empirical Study Identifying Bias in Yelp Dataset

by Seri Choi

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2021 © Massachusetts Institute of Technology 2021. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science Jan 25, 2021

Certified by...... Alex Pentland Professor of Media Arts and Sciences Thesis Supervisor

Accepted by ...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 An Empirical Study Identifying Bias in Yelp Dataset by Seri Choi

Submitted to the Department of Electrical Engineering and Computer Science on Jan 25, 2021, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Online review platforms have become an essential element of the business industry, providing users in-depth information on businesses and other users’ experiences. The purpose of this study is to examine possible bias or discriminatory behaviors in users’ rating habits in the Yelp dataset. The Surprise recommender system is utilized to produce expected ratings for the test set, training the model with 75% of the original dataset to learn the rating trends. Then, the ordinary least squares (OLS) linear regression is applied to identify which factors affected the percent change and which categories or locations show more bias than the others. This paper can provide insights into ways that bias can manifest within a dataset due to non-experimental factors such as social psychology; future research into this topic can therefore take these non-experimental factors, such as the discriminatory bias found in Yelp reviews, into consideration in order to reduce bias when utilizing machine learning algorithms.

Thesis Supervisor: Alex Pentland Title: Professor of Media Arts and Sciences

3 4 Acknowledgments

I would like to acknowledge the dedication and effort made by my advisors, Profes- sor Pentland and Yan Leng, towards expanding my understanding and the research process.

I would also like to acknowledge my family for unconditional love and support.

5 6 Contents

1 Introduction 13

2 Related Work 15

2.1 Recommender Systems ...... 15 2.2 Bias in Social Psychology ...... 16 2.3 Bias in Recommender Systems ...... 17

3 Background 19

3.1 Surprise Recommender System ...... 19 3.2 Linear Regression ...... 20 3.2.1 Ordinary Least Squares ...... 20 3.2.2 Multicollinearity ...... 21

4 Empirical Results 23

4.1 Data Descriptions ...... 23 4.1.1 Business Data ...... 23 4.1.2 User Data ...... 27 4.1.3 Review Data ...... 28 4.2 Hypothesis ...... 28 4.3 Experimental Settings ...... 30 4.3.1 Train and Test Distribution ...... 30 4.3.2 Metrics for Evaluation ...... 30 4.4 OLS Regression Results ...... 32

7 4.4.1 Specification 1: Effects from Year ...... 33 4.4.2 Specification 2: Effects from Users and Businesses . . 34 4.4.3 Specification 3: Extending with Ethnic Groups . . . . . 35 4.4.4 Specification 4: Including States, by State Color . . . . . 36 4.4.5 Specification 5: Including States, by State Region . . . 38 4.5 Discussion ...... 39

5 Conclusion 41

A Figures 43

8 List of Figures

2-1 The Life Cycle of Recommender System Biases [11] ...... 17

3-1 Workflow of a Recommender System [26] ...... 19

4-1 8 Components of Business Data ...... 24 4-2 Division by State Colors ...... 26 4-3 Division by State Regions ...... 26 4-4 Proportion of Google Searches With the "N-word" [10] ...... 29 4-5 Simplified Correlation Matrix Using ...... 32 4-6 Regression Results: Effects from Year ...... 33 4-7 Regression Results: Effects from Users and Businesses ...... 34 4-8 Regression Results: Extending with Ethnic Groups ...... 35 4-9 Regression Results: Including States, by State Color ...... 37 4-10 Regression Results: Including States, by State Region ...... 38

A-1 A Heatmap of Independent Variables Correlation Matrix ...... 44

9 10 List of Tables

4.1 Category Group Classification ...... 25 4.2 Summary Statistics of Data Dividing By State Color ...... 27 4.3 Summary Statistics of Data Dividing By State Region ...... 27

11 12 Chapter 1

Introduction

As information technology has rapidly grown over the years, communication, mobile, and networking have become imperative factors for many businesses [25]. Both the scale and usage of digital platforms also have increased in society. Customers are using information on those platforms to determine which businesses, such as restaurants or salons, can provide their next best experience. In fact, such exposure prior to visiting the business can trigger users to build some bias, whether positive or negative. Furthermore, some users might compare their previous experiences or apply their prejudice against the new businesses they visit. Discrimination is a topic that has been pervasive throughout the history; it has even affected the digital space. For the online rental marketplace Airbnb, research showed that non-black hosts are able to charge approximately 12% more than black hosts, given the equal location and quality of the properties [14]. Many research have utilized online review metrics and recommender systems to suggest similar products to consumers or to forecast performance or revenue of busi- nesses [30]. However, in this paper, a recommender system is applied in a different context where predicted values denote the standardized value for the population. Using a data set from the online review platform Yelp, the goal of this paper is to discern whether discriminatory ratings disproportionately impact the businesses of certain nationality groups. Given two restaurants that have the same qual- ity, table service, and ambience with the only notable difference being the type of

13 served or the ethnic identification of the restaurant, would users rate these restaurants differently due to some other biases they are holding? By analyzing users’ behaviors on how they rate businesses on an online review platform, this paper aims to help determine the prevalence of biased or discriminatory reviews. This will allow better understanding of how users’ biased perceptions affect even the most seemingly straightforward inputs such as their ratings of an establishments product or service, as well as providing awareness of possible biased datasets that can affect the results of machine learning algorithms.

14 Chapter 2

Related Work

2.1 Recommender Systems

Personalized recommendations have become widely available for many due to the widespread adoption of recommender systems. Especially, with the availability of extensive data collected for both users and businesses, the design of recommender systems has taken larger weights on taking more information like user-generated con- tents to improve the quality of recommendations [22]. For example, Ansari et al. integrates user reviews, user-generated tags, and product covariates into a probabilis- tic topic model framework, which can characterize products in terms of latent topics and specifies user preferences [6]. Other studies have focused on using geographical data to enhance recommenda- tions. Hu et al. utilized Yelp business review data to study business rating prediction. They observed that there exists weak positive correlation between a business’s aver- age rating and the average rating of its neighbors, regardless of the categories of businesses [17]. A study conducted by Cosley et al. utilizes the MovieLens rec- ommender system, which uses collaborative filtering to make recommendations, to identify whether exposing users to predicted ratings affects the way they rate movies. While their experiments conclude that users’ opinions are influenced when re-rating the movie they have already watched, it was not identifiable how users would behave differently for the movies the user have not seen yet [12]. Similarly, there canbe

15 other factors that can influence users’ opinions that will reflect in the dataset, aswell as the machine learning model results. It is important that recommender systems designers and researchers focus on delivering better recommendations for users, thus it is essential that the input data to the model is as accurate as possible.

Therefore, this paper aims to discover any non-experimental factors that can alter the analysis, indicating bias in a dataset.

2.2 Bias in Social Psychology

Brewer explains in detail the social psychology of intergroup relations. The social world can be divided into two groups: ingroups and outgroups. This partition is caused by basic cognitive categorization processes, emerging from group-based atti- tudes, perceptions, and behavior. Ingroup favoritism, caused by attachment to and perference within the group, brings up intergroup discrimination, irrespective of at- titudes toward specific outgroups [8].

The Census Bureau estimated 76.3% White, 13.4% Black, 18.5% Hispanic, and 5.9% Asians in the population in 2020 [4]. Attitudes and emotions toward specific outgroups reflect appraisals of the nature of the relationships between ingroup and outgroup that have implications for the maintenance or enhancement of ingroup resources, values, and well-being [8]. Ingroups and outgroups division can easily occur based on the size difference between the groups. Although racial demographics of the users on Yelp is not publicly available, it would be interesting to find out which ethnic group is most affected and would be considered as outgroups.

Donalson et al. suggests that different social and personal aspects can influence how someone can respond to questions posed by organizational researchers with bias. Some factors included agreement among co-workers, propensity to give socially de- sirable reponses, and fear of reprisal [13]. This paper seeks to find out if this type of bias in social psychology is also reflected on the online business industry, especially focusing on racial aspects.

16 2.3 Bias in Recommender Systems

Many machine learning algorithms are initially introduced as innovative mediums that can learn and make predictions without giving explicit commands to a computer. However, researchers have started to notice the evidence of algorithm bias over the years. For example, Sweeney shows that a background check service’s ads were more likely to appear in a paid search ad displayed after a search for names that are tradi- tionally associated with African-Americans [29]. Buolamwini et al. experimented on three different facial-analysis programs to find that the error rates in determining the gender of light-skinned men were never worse than 0.8%. While for darker-skinned women, the error rates increased up to 34.7%. This results caused by unbalanced phenotypical datasets used for the classification model. This study bring up impor- tance of building inclusive training sets and benchmarks in order to avoid bias and misclassification. [9]. Another study researched how women saw fewer job opportu- nity ads in the Science, Technology, Engineering and Math (STEM) fields compared to men, due to a disproportionate gender environment. Their results demonstrate that even when decisions are made by algorithms and human biases are removed, the outcome may still disadvantage one group relative to another. [20].

Figure 2-1: The Life Cycle of Recommender System Biases [11]

17 Research conducted by Chen et al. provides a systematic overview of existing work on recommender system biases. Figure 2-1 demonstrates three stages of the life cycle of recommendation in a feedback loop: User, Data, and Model. One of the key types of bias that is relevant to this paper is unfairness. Unfair- ness can be defined as "the system systematically and unfairly discriminates against certain individuals or groups of individuals in favor others" [15]. Unfairness has been an overbearing issue that is deeply entrenched within American society, and it also influences the recommender system to act similarly. This is caused by unequally represented user groups in the dataset, based on attributes like race, gender, age, educational level, or wealth. When training on such unbalanced data, the models can result in systematic discrimination and reduce the representations for racial stereo- types [11]. Using this definition, the goal of this paper is to identify unfairness that might have occurred from social psychology from the Yelp data.

18 Chapter 3

Background

3.1 Surprise Recommender System

Figure 3-1: Workflow of a Recommender System [26]

A recommender system is a machine that provides personalized information by being trained on users’ previous ratings towards businesses, and through two main approaches: collaborative filtering and content-based filtering. Collaborative filter- ing recommends products based on the interests of users with similar ratings, while content-based filtering recommends products similar to other products that userhas

19 expressed interest [22]. In 2017, Nicolas Hug created the Recommendation System algorithms-based kit called Surprise, a python scikit package that supports multiple recommendation system algorithms [26]. One of the algorithms is Singular Value De- composition++ (SVDpp), which is an extension of SVD taking into account implicit ratings [18]. SVD-based algorithms can produce high quality recommendations in a short time, and its results indicate better performance than a traditional collaborative filtering algorithm on a MovieLens dataset [27]. In order to identify bias within the dataset, there needs to exist a set bar of the norm to determine if the user rated a given business lower or higher than the expected. Implicit feedback used in a SVDpp algorithm refers to users’ past history of user rating behaviors that can assist users’ preference [19]. Therefore, a SVDpp algorithm of the Surprise recommendation system is the best model to utilize in this paper to correctly predict and set the expected ratings for businesses based on all users’ past rating behaviors.

3.2 Linear Regression

Linear regression is the procedure that estimates the coefficients of the linear equation, consisting of one or more independent variables that best predict the quantitative value of the dependent variable [5].

3.2.1 Ordinary Least Squares

A linear regression model is typically based on ordinary least squares (OLS) esti- mates, which are calculated by minimizing the sum of squared differences between the dependent variable in the data and those predicted by the linear function. There are two important properties of the OLS estimator: unbiasedness and efficiency. Un- biasedness states that the estimator is consistent for very large samples, and as the sample size grows, the estimator becomes closer to the true value. Efficiency de- notes that the OLS is the most accurate estimator given the available sample and the assumptions made [31].

20 One of the assumptions when using the OLS estimator is that there exists no multicollinearity. This means that there should be no linear relationship among in- dependent variables. This leads to the Gauss-Markov theorem, which states that the OLS estimator has the lowest sampling variance within the class of linear unbiased estimators [31].

3.2.2 Multicollinearity

When a full regression model is run, multicollinearity amongst independent variables can occur, which then can inflate the variance of the variables. One of the ways to identify multicollinearity in a linear regression model is by using the variance inflation factor (VIF) [24], which is calculated as follows:

1 푉 퐼퐹푥 = 2 1 − 푅푥

2 where 푅푥 represents an R-squared value when variable 푥 is regressed with the rest of the variables in the data. The VIF value ranges from 1 to infinity, but it should not exceed 10, as this indicates a multicollinearity [7].

21 22 Chapter 4

Empirical Results

This section introduces the main dataset, Yelp, used in this paper for empirical anal- ysis. After the experimental setup descriptions, linear regression models based on ordinary least squares (OLS) estimates are analyzed to understand which factors affect users’ rating behaviors.

4.1 Data Descriptions

This section introduces the dataset provided by Yelp,1 an online review platform where users may rate and write reviews on businesses, such as restaurants, grocery stores, department stores etc. Yelp collects business information by gathering it from the owners themselves or by crowdsourcing from users’ experiences. The Yelp dataset is very rich; it contains over 8 million reviews and information on roughly 2 million users and 200,000 businesses.

4.1.1 Business Data

Figure 4-1 summarizes 8 different components from the business data that is used for the analysis. Each will be explained in more detail in this section.

1The data can be obtained at https://www.yelp.com/dataset/.

23 Figure 4-1: 8 Components of Business Data

Price Range

There are over 1.4 million business attributes like hours, parking, availability, and ambience to provide more high-dimensional information on each business. However, there are only 39 unique types of attributes. Among those, the only attribute type that is used for this paper is the price range of the business. The price range spans from 1 to 4, where a higher number denotes a more expensive product or service. Those that do not have a price range as part of their attributes get a 0 as a default.

Categories + Category Groups

Within the business data, there is also a list of categories that represent each business. For example, for a taco place, the list would include Restaurants, Mexican, and Tacos. As it is shown, each business may belong to multiple categories. Because this research is focused on identifying bias towards businesses of certain ethnic groups, categories that represent nationalities are manually chosen from the list of all categories of the dataset. As a result, there are 91 filtered categories, which include American

24 Categories African South African, Ethiopian, Moroccan North American American, Southern, Canadian, Tex-Mex Latin American Cuban, Colombian, Puerto Rican, Haitian West Afghan, Indian, Pakistani, Arabian East Asia Chinese, Japanese, Korean, Taiwanese Southeast Asia Indonesian, Malaysian, Thai, Vietnamese Western Italian, French, German, Belgian, Irish Eastern Europe Hungarian, Ukranian, Czech/Slovakian Mediterranean Persian/Iranian, Lebanese, Syrian Middle Eastern Greek, Egyptian, Portuguese, Spanish

Table 4.1: Category Group Classification

(Traditional), Korean, Brazilian, and Ethiopian. While Yelp also includes businesses other than restaurants, after filtering the dataset based on chosen categories, 100% of the businesses included “Restaurants” as one of the categories. Although many times likely do correlate with the ethnicities of the business owners, in order to avoid making assumptions, e.g. assum- ing that a taco place is ran by a Mexican-American owner, these categories represent cuisines of the restaurants rather than ethnic groups. Then, these categories are classified into 10 different groups as shown in Table 4.1. As a categorical value,all category groups applicable to a restaurant are indicated using one-hot encoding for each review.

State Color + State Region

With regards to location information, it should be noted that, although the Yelp dataset includes data from businesses in many countries, this paper only takes into consideration businesses from the dataset that are located within the United States. There are two ways, for the purpose of this study, that these businesses can be broken down by state:

• by color: red, blue, or purple

25 • by region: northeast, midwest, south, or west

Figure 4-2: Division by State Colors Figure 4-3: Division by State Regions

The state color classification is based on the results of the 2008, 2012, 2016,and 2020 presidential elections by state, as shown in 4-2. Blue shades represent those states that are carried by the Democrats in three or four out of the four elections, while red shades represent states that are, conversely, carried by the Republicans. The purple shade represents the states carried by each party twice in the four elections [3]. Figure 4-3 shows the four statistical regions defined by the United States Census Bureau [2]. These classifications help the model to analyze the United States intwo different contexts. Similar to category group, state color and state region variables are considered categorical, thus they are converted numerically using one-hot encoding. For example, a restaurant in Minnesota, classified as a blue state in the midwest region, the data would show as follows:

[red, blue, purple] = [0, 1, 0]

[northeast, midwest, south, west] = [0, 1, 0, 0]

A further analysis is done by summarizing statistics of data when the locations of businsses are divided by state colors and regions.

26 Red Blue Purple Rating count 312,886 367,226 48,308 User count 167,006 215,093 27,174 Business count 8,904 8,178 3,034 Average rating (std.) 3.782 (1.400) 3.818 (1.370) 3.687 (1.401)

Table 4.2: Summary Statistics of Data Dividing By State Color

Northeast Midwest South West Rating count 45,761 72,603 67,242 542,814 User count 25,309 41,141 36,654 302,071 Business count 2,283 4,338 2,827 10,668 Average rating (std.) 3.726 (1.346) 3.686 (1.379) 3.702 (1.389) 3.826 (1.387)

Table 4.3: Summary Statistics of Data Dividing By State Region

City + Review Counts + Ratings

The information provided in the dataset regarding cities is also used, mapping each city to the total number of restaurants in that city, to control the concentration of businesses considered. This is necessary because, when more restaurants are available in an area for customers, those restaurants have a higher number of visits, and a higher probability of receiving visitors. Finally, business review counts and ratings are used in the model to identify if higher review counts or a higher rating would affect the users’ perspectives.

4.1.2 User Data

The user data includes the number of reviews the user has written, a list of the user’s friends, the number of votes sent by the user and received by other users, and more. There are two important attributes to consider within the user data: review count and average rating of all reviews. Using this information, for each review, the user’s total average rating – that does not include the corresponding review – is calculated as follows: 푟푐푖 × 푎푣푔푟푖 − 푟푗 for 푟푐푖 > 1 푟푐푖 − 1

27 where 푟푐푖 represents a review count for user 푖, 푎푣푔푟푖 an average ratings for user 푖, and

푟푗 a rating for review 푗. This variable allows the model to control the user’s previous rating habits, as one user may rate more 2s while the other may rate more 5s. This feature helps to ensure that the results are not skewed by the tendency of specific users to rate businesses conservatively or liberally overall.

4.1.3 Review Data

The review data is the main numerical data used in this research to determine the existence of bias in the dataset. Yelp review data includes the user ID, business ID, and review rating. Using IDs, a 푛 × 푚 user-business matrix is created, where 푛 represents the number of users and 푚 represents the number of businesses. Each entry in the matrix 푟푖푗 denotes a rating 푟 that ranges from 1 to 5, reviewed by user 푖 about business 푗. For example, if there are 2 users and 3 business, the matrix can be as follows: ⎡ ⎤ 3 0 4 푀 = ⎣ ⎦ 2 5 0

This matrix shows that 푢1 rated 3 and 4 for 푏1 and 푏3, respectively, while 푢2 rated

2 and 5 for 푏1 and 푏2, respectively. Then, this matrix is used for the Surprise rec- ommendation system, explained in the Background chapter, to train on user’s rating behaviors, thus allowing the recommender system to predict expected rating values for other users towards some businesses.

4.2 Hypothesis

Based on social norms and political views, the following is the list of hypotheses made prior to running the model:

1. Certain businesses under minority groups, e.g. African, would have more bias against them compared to other groups, such as North American and Western European.

28 2. Republican (red) states would have more bias than the other states.

3. The Southern region of the United States would have more bias than the other regions.

Hypotheses were ideated from other studies about racism based on ethnic groups, locations, and political beliefs. According to the Pew Research center’s 2016 Racial Attitudes in America Survey dataset, a study, conducted by Lee et al., compared the experiences of discrimination of minorities (Black, Hispanic, and Asian) and Whites. Results showed that 63.10% of minorities experience racial discrimination compared to 29.61% of Whites. To break it down by ethnic groups, 69.45% of Blacks, 56.59% of Asians, and 45.01% of Hispanics reported the prevalence of discrimination in their experiences. [21]

Figure 4-4: Proportion of Google Searches With the "N-word" [10]

Pasek et al. conducted surveys in 2012 to measure implicit racism among Democrats and Republicans by assessing how respondents’ express affections to different racial groups. While 55% of Democrats expressed implicit anti-black attitudes, the per- centage was higher for Republicans, at 64% [23]. Chae et al. studied on area racism by utilizing the proportion of Google search queries containing the "N-word" in 196 designated market areas from 2004 to 2007, and as illustrated in Figure 4-4, the rural

29 Northeast and South regions of the United States showed high concentrations [10]. Using these information, the regression analysis results seek to determine whether the hypotheses are true and whether these trends are reflected in trends of rating and review behavior on the online review platform Yelp.

4.3 Experimental Settings

4.3.1 Train and Test Distribution

Considering how peoples’ perspectives might change over the years, as Yelp data is ranging from 2004 to 2019, when using Surprise to predict ratings, the dataset is first divided into each year. Then for each year’s data, it is split into random train and test subsets, where sizes are 75% and 25% of the year’s data. The model is not trained using the random three-quarter of the whole data across all years due to the fact that each year’s dataset varies vastly in size. For example, the sum of reviews from 2004 to 2007 is 12,093, while the numbers of reviews in 2008 and 2019 alone are 23,906 and 432,275, respectively. Thus, for the results of this research paper, only reviews dated from 2008 to 2019 are analyzed. Additionally, by separating datasets by year in order to train and test, one can ensure that the same percentage of the test sets are selected across all years. As a result, the train and test dataset sizes are 2,185,237 and 728,420 reviews, respectively.

4.3.2 Metrics for Evaluation

Dependent Variable: Percent Change

The dependent variable used in the linear regression is percent change from the actual to predicted ratings, which is calculated as follows:

푝푟푖푗 − 푎푟푖푗 percent change = × 100 푎푟푖푗

30 where 푝푟푖푗 denotes predicted rating and 푎푟푖푗 denotes actual rating for user 푖 on business 푗. Predicted ratings calculated from the Surprise recommender system indicate what the ratings should have been based on previous user rating behaviors. Thus, on the spectrum of bias ranging from negative to positive perspectives towards a business, predicted ratings would lie in . If the actual rating falls on the left side of the spectrum, this suggests that a user had some bias against the given business, while the right side represents that the user had more favors towards it.

Independent Variables

In a multivariable regression model, no independent variable should depend so strongly on the other independent variables to have a meaningful result. When the indepen- dent variables are mutually dependent, the result makes no additional contribution toward explaining the dependent variable [28]. Thus, to analyze the relationships between independent variables, a correlation matrix is calculated. A correlation ma- trix using all independent variables can be found as a heatmap in Figure A-1. The following is a simpler version of the heatmap that showed high correlations between variables: The list of pairs that have a magnitude of greater than 0.5 is as follows:

• city count: state color red & state color blue

• state color blue: state color red

• state color purple: state region midwest

• state region west: state region midwest & state region south

Because these two independent variables are categorical, the regression coefficient must be interpreted in reference to the numerical encoding of these variables [28]. Also, based on the above correlation coefficients, both state colors and state regions cannot be in the model together. Not only will the following subsections further explain two different regression models using state color and state region separately; they will also specify which subgroup of the independent variables is selected as the

31 Figure 4-5: Simplified Correlation Matrix Using reference variable. This allows a comparison to be drawn between the effect of the dependent variable on the other subgroups, in relation to the reference variable.

4.4 OLS Regression Results

In this section, five different linear regression specifications are presented, introducing more variables each time and discussing about meanings behind numbers shown on the results. To summarize the variables used, the example starts with a simple bivariate regression relating the percent change to year, to see how percent change has altered over the years (specification 1). Next, users’ rating habits, price range, and business ratings and review counts are added to the model (specification 2). The model is again extended by including cuisines as categorical values (specification 3). Then, state color and state region variables are included separately to identify how percent

32 change is affected by geographical factors (specification 4 and 5).

Coefficients Analysis

A positive coefficient of the binary variables in the results denotes that compared to the reference variable, that variable had more bias. The reason is that the numerator of the percent change is calculated by subtracting the actual rating from the predicted rating. Thus, the positive result shows that the actual was lower than predicted, representing bias. Not only does regression analysis describe how the changes of the independent variable affect the value of the dependent variable, it also statistically controls every variable in the model. This means that regression analysis isolates each variable and estimates the effect on the dependent variable. Especially in regression with multiple independent variables, the coefficients denote how much the dependent variable is expected to increase when that independent variable increases by one, while holding all the other variables constant [1, 16].

4.4.1 Specification 1: Effects from Year

Figure 4-6: Regression Results: Effects from Year

The estimated coefficient for year is 1.404. This indicates that as each yeargoes by, the percent change increases by 1.4% on average (Note that percent change is already multiplied by 100 to be in percentage form). This interpretation does not control for other factors that can affect the percent change; it simply estimates the percent change over the years. The results could mean that users would tend to show more bias within their rating behaviors over the years, which might have caused from increasing expectations with exposed to various and better businesses.

33 4.4.2 Specification 2: Effects from Users and Businesses

Figure 4-7: Regression Results: Effects from Users and Businesses

This model includes more variables that can affect how people review businesses after their experiences. As shown in Figure 4-7, notice how the coefficient for year has increased by extra 1% compared to the last specification. When new variables are introduced and are controlled, those factors cause the percent change to increase more over the years. User average excluded variable and business ratings have very large magnitudes and are negative. Interpreting this in words would mean that as business rating increases by 1, the percent change would decrease by 29%. A large magnitude might have resulted from how percent change is calculated. For example, suppose there is a user, who rates 4s on average, at a restaurant with 4 stars, and that user had a bad experience during the and decided to leave only 1 star on the review. Using previous rating history and applying that to the Surprise recommender system, assume the expected rating came out to be 3.5. Then, because the actual rating is on the denominator when calculating percent change, the lower it is, the greater percent change becomes. In this example, the percent change would be calculated as (3.5 − 1)/1 × 100 = 250%. However, it makes sense that user average excluded and business ratings would have big impacts on users’ rating behaviors. This means that as the user average rating increases, it shows how users, who generally rate high stars for other restau- rants, will continue to leave good ratings on the review platform. As the value of business ratings become higher, users will earn more trust towards that businesses and prioritize going there as their next experiences.

34 While the number of business reviews did not affect the price change, the price range variable shows a little effect by 0.24%, indicating that the more expensive the business is, the more likely customers would be to express negative rather than positive reviews. City count, which measures the total number of businesses of the corresponding city, also only affects the price change by 0.08% on average. However, by including this variable, the model ensures to control the concentration of the area in which each business is located. As described in the Background section, a variance inflation factor (VIF) of value higher than 10 indicates a multicollinearity [7]. Because none of the variables exceeds the threshold as shown on the last column of Figure 4-7 and no variables have high correlation values among the group, this proves that this model does not have a multicollinearity issue.

4.4.3 Specification 3: Extending with Ethnic Groups

Figure 4-8: Regression Results: Extending with Ethnic Groups

This model is extended with categorical variables for ethnic groups of businesses on top of specification 2, which shows that as while user and business attributes andtime (year) are all controlled, the model describes which cuisine affects the percent change. Then, this model can answer the following question mentioned in the Introduction:

35 Given two similar restaurants with the only notable difference being the type of cuisine served or the ethnic identification of the restaurant, would users rate these restaurants differently due to some other biases they are holding? Using Middle Eastern as a reference variable, three other cuisines customers would hold bias against are African, West Asian, and Western European, indicating at least 1% increase to the percent change. The highest coefficients shown in African variable mean that when a business is categorized as African, the percent change increases by 3.44%, where actual rating is that much lower than the expected rating. On the contrary, both North American and Southeast Asian categories would decrease the percent change by around 1% on average, indicating that users are expressing more favors towards these cuisines. Based on the first hypothesis, while it is proven true that under-represented ethnic groups as African and West Asian are biased against those categories, it is surprising that Western European is part of the group as it is one of the cuisines that dominates United States. On the other hand, those that are rated in more favor are North American and Southeast Asian businesses, which is understandable as American, Thai, and Vietnamese cuisines are familiar with users in the United States. Usually, a p-value of under the significance level (typically 0.05) represents if the variable is statistically significant. However, a high p-value does not prove thatthe variable has no effect in the model; it suggests that there is not enough evidence to prove that such effect exists in the population. An effect might exist butit’s possible that the effect size is too small, the sample size is too small, or there istoo much variability for the hypothesis test to detect it [16]. However, because the model describes low VIF values for all independent variables and low correlations are already checked for, the model still describes meaningful results.

4.4.4 Specification 4: Including States, by State Color

Based on the correlations between independent variables and the OLS estimator as- sumptions, it is evident that city count variable cannot coexist with state color vari- ables in the linear regression model. Similarly, state color of blue and red should not

36 be in the model together. However, if one of them were to be a reference variable, the state region midwest and state color purple still have high correlation that state color and state region had to be observed separately. Ensuring all independent variables are independent among each other is extremely important considering the Gauss-Markov assumptions [31].

Figure 4-9: Regression Results: Including States, by State Color

Figure 4-9 shows the result of OLS regression using state color variables. In this model, reference variables are Middle Eastern cuisine and state color red, and note that city count variable is omitted. Compared to the previous specification, most variables have similar coefficients except East ; its coefficient increased from -0.162% to 0.092%. However, this can mean that this change is subtle and that East Asian category, in fact, does not have huge affect on the percent change. This interpretation comes from how African category also show 0.2% increase and Eastern European and Mediterranean also show about 0.1% increase. These differences show that restaurant categories that were suspected to receive negative bias show this bias to an even higher degree when the state colors are introduced and controlled. Analyzing this regression model that controls the state color, the second hypoth- esis holds true as the results indicate how businesses in both blue and purple states tend to be rated higher than the ones in red states. Thus, this result supports the sur-

37 vey that described how Republicans tend to express more implicit anti-black attitudes compared to Democrats [23], which is also projected in the online review platform Yelp when they rate businesses. Similar to the previous specification, all variables have VIF values within the range of 1 and 2.5, which is significantly lower than the threshold that indicates mul- ticollinearity, which is 10. No variables have high correlation values among the group and, with low IVF values, this proves that this model does not have a multicollinearity issue.

4.4.5 Specification 5: Including States, by State Region

Because the city count does not show high correlation with state regions, the vari- able is added back to this regression model. The region west variable is used as a reference variable as it indicated high correlations with two other region variables. Therefore, the other regions will be analyzed using the West variable as their stan- dardized category. Figure 4-10 shows the result of OLS regression using state region variables.

Figure 4-10: Regression Results: Including States, by State Region

Because this model is extended on specification 2, when the results are compared,

38 it is notable that Latin had the most change, its coefficient increasing from -0.3% to 0.04%. This is a similar case for East Asian cuisine in specification, where the coefficient changed from negative to positive. The rest of the coefficients for cuisines generally showed that the effects became more "extreme:" while the positive ones became more positive, the negative ones became more negative. The newly added city count variable has a very little effect on the percent change, similar to the number of business reviews. However, compared to specification 3, the coefficient decreased from 0.083% to 0.004%. This indicates how the state regionsare considered independent variables, counting the number of businesses in the city level became less significant. Other new variables, state regions, show in the OLS results that all midwest, northeast, and south regions generally give higher ratings compared to the west. This contradicts the third hypothesis that the south region would express more biased behaviors. Compared to Figure 4-4, where the Northeast and South regions had higher number of Google searches with the "N-word," the regression results reveals that those regions tend to rate all businesses higher than the expected. As described in the previous specifications, this model also shows low VIF values and no independent variable has a high correlation to another, indicating there is no multicollinearity.

4.5 Discussion

One factor that signifies that the regression models used in the analysis are trust- worthy is that despite the different variables included and excluded, most of the coefficients in each specification generally stayed the same. This ensures thatwhen new variables are introduced, the previous variables are fully controlled, making the analysis independent from the other specification. For example, when specification 5 is analyzed, it is already known how each cuisine affects the percent change from specification 3 that when state region variables are introduced, we can isolatethe location and determine the relationship to the dependent variable, percent change.

39 From the OLS regression results, the hypothesis is tested and the results are as follows:

1. Certain businesses under minority groups, e.g. African, were found to have more bias against them, but also some groups like Western European cuisines surprisingly had more negative reviews.

2. Republican (red) states had more bias than the other states (blue and purple).

3. The Northeastern and Southern regions were found to rate in favor more often than the other two regions.

From this analysis, similar bias that exists in social psychology, regarding ingroups and outgroups, was found in the online review platform Yelp. Although not all results follow the well-known biased perspectives people usually think of, it is important to note that these types of bias can exist in a dataset.

40 Chapter 5

Conclusion

This paper analyzed a dataset from the online review platform Yelp to determine the prevalence of discriminatory behaviors from users’ review and rating habits. An expected rating of a business is calculated by using the SVDpp algorithm from the Surprise recommender system, which uses implicit feedback of users’ past ratings to predict. Applying the percent change from the actual to expected ratings as the dependent variable, OLS linear regression models are established. Using different variables for each specification, regression model results are analyzed in depth tofind the true relationship from one variable that leads to bias in rating habits. Independent selection started off by determining some variables that would af- fect the percent change from the dataset. Those variables were year, user’s average ratings, businesses average ratings and review counts, and the price range of the business. Then, these variables were used as controlled variables when new indepen- dent variables are introduced. The results indicated that users tend to express more bias against those businesses that were categorized as African, Western Asian, and Western European. On the other hand, restaurants that were categorized as North American and Southeast Asian were more favored in the reviews. The models also demonstrated the effect towards the percent change based on which state the businesses are located in. The United States was divided two different ways: by state color, based on political views, and by 4 regions. The results indicated that the Republicans tend to show bias against businesses compared to the Democrats.

41 It also showed that the West would generally rate with biased behaviors compared to the rest of the regions. Not only does this analysis check the user rating behavior by how their rating habits or how high of a rating the business has can affect their future choices, but it also delves into how cultural and political aspects and perspectives can affect users’ preferences. This information will help businesses be aware of these biases, so that they can account for them when representing data. For future works, other researchers can be aware of how a non-experimental fac- tor, such as social psychology, can exist in a dataset, thus they can take that into consideration when using the dataset on machine learning algorithms. Many studies have shown that due to imbalanced or biased datasets, the outputs from the models also resulted in skewed data. Thus, it is crucial that the input data is fully observed and that possible obstacles or biases are removed as needed. This research can be extended by incorporating new information, such as demo- graphics of the users or the business owners. If this new information were considered, similar studies would be able to identify more detailed, direct relationships between customers’ demographic biases and customers’ ratings toward certain category groups. Further studies may also find it fruitful to differentiate businesses by chain restau- rants or small businesses, to see if the user ratings are harsher on one type or the other.

42 Appendix A

Figures

43 Figure A-1: A Heatmap of Independent Variables Correlation Matrix

44 Bibliography

[1] Interpreting regression output.

[2] List of regions of the united states.

[3] Red states and blue states.

[4] U.s. census bureau quickfacts: United states.

[5] E. C. Alexopoulos. Introduction to multivariate regression analysis, Dec 2010.

[6] Asim Ansari, Yang Li, and Jonathan Z. Zhang. Probabilistic topic model for hybrid recommender systems: A stochastic variational bayesian approach. Mar- keting Science, 37(6):987–1008, 2018.

[7] David A Belsley, Edwin Kuh, and Roy E Welsch. Regression diagnostics: Iden- tifying influential data and sources of collinearity, volume 571. John Wiley & Sons, 2005.

[8] Marilynn B Brewer. The social psychology of intergroup relations: Social cate- gorization, ingroup bias, and outgroup prejudice. 2007.

[9] Joy Buolamwini. Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers. PhD thesis, 01 2017.

[10] David H. Chae, Sean Clouston, Mark L. Hatzenbuehler, Michael R. Kramer, Hannah L. F. Cooper, Sacoby M. Wilson, Seth I. Stephens-Davidowitz, Robert S. Gold, and Bruce G. Link. Association between an internet-based measure of area racism and black mortality. PLOS ONE, 10(4):1–12, 04 2015.

[11] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. Bias and debias in recommender system: A survey and future directions, 2020.

[12] Dan Cosley, Shyong K. Lam, Istvan Albert, Joseph A. Konstan, and John Riedl. Is seeing believing? how recommender system interfaces affect users’ opinions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, page 585–592, New York, NY, USA, 2003. Association for Computing Machinery.

45 [13] Stewart I. Donaldson and Elisa J. Grant-Vallone. Understanding self-report bias in organizational behavior research, Dec 2002.

[14] Benjamin G. Edelman and Michael Luca. Digital discrimination: The case of airbnb.com. SSRN Electronic Journal, 2014.

[15] Batya Friedman and Helen Nissenbaum. Bias in computer systems. ACM Trans. Inf. Syst., 14(3):330–347, July 1996.

[16] Jim Frost. Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models. Statistics By Jim Publishing, 2019.

[17] Longke Hu, Aixin Sun, and Yong Liu. Your neighbors affect your ratings: On geographical neighborhood influence to rating prediction. In Proceedings of the 37th International ACM SIGIR Conference on Research amp; Development in Information Retrieval, SIGIR ’14, page 345–354, New York, NY, USA, 2014. Association for Computing Machinery.

[18] Nicolas Hug. Surprise: A python library for recommender systems. Journal of Open Source Software, 5(52):2174, 2020.

[19] Rajeev Kumar, B. K. Verma, and Shyam Sunder Rastogi. Social popularity based svd++ recommender system. International Journal of Computer Applications, 87(14), 2014.

[20] Anja Lambrecht and Catherine Tucker. Algorithmic bias? an empirical study of apparent gender-based discrimination in the display of stem career ads. Man- agement Science, 65(7):2966–2981, 2019.

[21] Randy T. Lee, Amanda D. Perez, C. Malik Boykin, and Rodolfo Mendoza- Denton. On the prevalence of racial discrimination in the united states. PLOS ONE, 14(1):1–16, 01 2019.

[22] Yan Leng, Rodrigo Ruiz, Xiaowen Dong, and Alex ‘Sandy’ Pentland. Inter- pretable recommender system with heterogeneous information: A geometric deep learning perspective. SSRN Electronic Journal, 2020.

[23] Josh Pasek, Jon A. Krosnick, and Trevor Tompson. The impact of anti-black racism on approval of barack obama’s job performance and on voting in the 2012 presidential election, Oct 2012.

[24] Cecil Robinson and Randall E Schumacker. Interaction effects: centering, vari- ance inflation factor, and interpretation issues. Multiple linear regression view- points, 35(1):6–11, 2009.

[25] Roland Rust and Ming-Hui Huang. The service revolution and the transforma- tion of marketing science. Marketing Science, 33:206–221, 03 2014.

46 [26] Ananth G. S. and Raghuveer K. A movie and book recommender system using surprise recommendation. SSRN Electronic Journal, 2020.

[27] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Incremental singular value decomposition algorithms for highly scalable recommender sys- tems. In Fifth international conference on computer and information science, volume 1, pages 27–8. Citeseer, 2002.

[28] Astrid Schneider, Gerhard Hommel, and Maria Blettner. Linear regression anal- ysis: Part 14 of a series on evaluation of scientific publications, Nov 2010.

[29] Latanya Sweeney. Discrimination in online ad delivery, 2013.

[30] Pei-Ju Lucy Ting, Szu-Ling Chen, Hsiang Chen, and Wen-Chang Fang. Using big data and text analytics to understand how customer experiences posted on yelp.com impact the hospitality industry. Contemporary Management Research, 13(2):107–130, 2017.

[31] Marno Verbeek. Using linear regression to establish empirical relationships. IZA World of Labor, 01 2017.

47