An Empirical Study Identifying Bias in Yelp Dataset

An Empirical Study Identifying Bias in Yelp Dataset by Seri Choi Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2021 © Massachusetts Institute of Technology 2021. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science Jan 25, 2021 Certified by. Alex Pentland Professor of Media Arts and Sciences Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 An Empirical Study Identifying Bias in Yelp Dataset by Seri Choi Submitted to the Department of Electrical Engineering and Computer Science on Jan 25, 2021, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Online review platforms have become an essential element of the business industry, providing users in-depth information on businesses and other users’ experiences. The purpose of this study is to examine possible bias or discriminatory behaviors in users’ rating habits in the Yelp dataset. The Surprise recommender system is utilized to produce expected ratings for the test set, training the model with 75% of the original dataset to learn the rating trends. Then, the ordinary least squares (OLS) linear regression is applied to identify which factors affected the percent change and which categories or locations show more bias than the others. This paper can provide insights into ways that bias can manifest within a dataset due to non-experimental factors such as social psychology; future research into this topic can therefore take these non-experimental factors, such as the discriminatory bias found in Yelp reviews, into consideration in order to reduce bias when utilizing machine learning algorithms. Thesis Supervisor: Alex Pentland Title: Professor of Media Arts and Sciences 3 4 Acknowledgments I would like to acknowledge the dedication and effort made by my advisors, Profes- sor Pentland and Yan Leng, towards expanding my understanding and the research process. I would also like to acknowledge my family for unconditional love and support. 5 6 Contents 1 Introduction 13 2 Related Work 15 2.1 Recommender Systems . 15 2.2 Bias in Social Psychology . 16 2.3 Bias in Recommender Systems . 17 3 Background 19 3.1 Surprise Recommender System . 19 3.2 Linear Regression . 20 3.2.1 Ordinary Least Squares . 20 3.2.2 Multicollinearity . 21 4 Empirical Results 23 4.1 Data Descriptions . 23 4.1.1 Business Data . 23 4.1.2 User Data . 27 4.1.3 Review Data . 28 4.2 Hypothesis . 28 4.3 Experimental Settings . 30 4.3.1 Train and Test Distribution . 30 4.3.2 Metrics for Evaluation . 30 4.4 OLS Regression Results . 32 7 4.4.1 Specification 1: Effects from Year . 33 4.4.2 Specification 2: Effects from Users and Businesses . 34 4.4.3 Specification 3: Extending with Ethnic Groups . 35 4.4.4 Specification 4: Including States, by State Color . 36 4.4.5 Specification 5: Including States, by State Region . 38 4.5 Discussion . 39 5 Conclusion 41 A Figures 43 8 List of Figures 2-1 The Life Cycle of Recommender System Biases [11] . 17 3-1 Workflow of a Recommender System [26] . 19 4-1 8 Components of Business Data . 24 4-2 Division by State Colors . 26 4-3 Division by State Regions . 26 4-4 Proportion of Google Searches With the "N-word" [10] . 29 4-5 Simplified Correlation Matrix Using . 32 4-6 Regression Results: Effects from Year . 33 4-7 Regression Results: Effects from Users and Businesses . 34 4-8 Regression Results: Extending with Ethnic Groups . 35 4-9 Regression Results: Including States, by State Color . 37 4-10 Regression Results: Including States, by State Region . 38 A-1 A Heatmap of Independent Variables Correlation Matrix . 44 9 10 List of Tables 4.1 Category Group Classification . 25 4.2 Summary Statistics of Data Dividing By State Color . 27 4.3 Summary Statistics of Data Dividing By State Region . 27 11 12 Chapter 1 Introduction As information technology has rapidly grown over the years, communication, mobile, and networking have become imperative factors for many businesses [25]. Both the scale and usage of digital platforms also have increased in society. Customers are using information on those platforms to determine which businesses, such as restaurants or salons, can provide their next best experience. In fact, such exposure prior to visiting the business can trigger users to build some bias, whether positive or negative. Furthermore, some users might compare their previous experiences or apply their prejudice against the new businesses they visit. Discrimination is a topic that has been pervasive throughout the history; it has even affected the digital space. For the online rental marketplace Airbnb, research showed that non-black hosts are able to charge approximately 12% more than black hosts, given the equal location and quality of the properties [14]. Many research have utilized online review metrics and recommender systems to suggest similar products to consumers or to forecast performance or revenue of businesses [30]. However, in this paper, a recommender system is applied in a different context where predicted values denote the standardized value for the population. Using a data set from the online review platform Yelp, the goal of this paper is to discern whether discriminatory ratings disproportionately impact the businesses of certain nationality groups. Given two restaurants that have the same food quality, table service, and ambience with the only notable difference being the type of 13 cuisine served or the ethnic identification of the restaurant, would users rate these restaurants differently due to some other biases they are holding? By analyzing users’ behaviors on how they rate businesses on an online review platform, this paper aims to help determine the prevalence of biased or discriminatory reviews. This will allow better understanding of how users’ biased perceptions affect even the most seemingly straightforward inputs such as their ratings of an establishments product or service, as well as providing awareness of possible biased datasets that can affect the results of machine learning algorithms. 14 Chapter 2 Related Work 2.1 Recommender Systems Personalized recommendations have become widely available for many due to the widespread adoption of recommender systems. Especially, with the availability of extensive data collected for both users and businesses, the design of recommender systems has taken larger weights on taking more information like user-generated contents to improve the quality of recommendations [22]. For example, Ansari et al. integrates user reviews, user-generated tags, and product covariates into a probabilis- tic topic model framework, which can characterize products in terms of latent topics and specifies user preferences [6]. Other studies have focused on using geographical data to enhance recommendations. Hu et al. utilized Yelp business review data to study business rating prediction. They observed that there exists weak positive correlation between a business’s average rating and the average rating of its neighbors, regardless of the categories of businesses [17]. A study conducted by Cosley et al. utilizes the MovieLens recommender system, which uses collaborative filtering to make recommendations, to identify whether exposing users to predicted ratings affects the way they rate movies. While their experiments conclude that users’ opinions are influenced when re-rating the movie they have already watched, it was not identifiable how users would behave differently for the movies the user have not seen yet [12]. Similarly, there canbe 15 other factors that can influence users’ opinions that will reflect in the dataset, aswell as the machine learning model results. It is important that recommender systems designers and researchers focus on delivering better recommendations for users, thus it is essential that the input data to the model is as accurate as possible. Therefore, this paper aims to discover any non-experimental factors that can alter the analysis, indicating bias in a dataset. 2.2 Bias in Social Psychology Brewer explains in detail the social psychology of intergroup relations. The social world can be divided into two groups: ingroups and outgroups. This partition is caused by basic cognitive categorization processes, emerging from group-based attitudes, perceptions, and behavior. Ingroup favoritism, caused by attachment to and perference within the group, brings up intergroup discrimination, irrespective of attitudes toward specific outgroups [8]. The United States Census Bureau estimated 76.3% White, 13.4% Black, 18.5% Hispanic, and 5.9% Asians in the population in 2020 [4]. Attitudes and emotions toward specific outgroups reflect appraisals of the nature of the relationships between ingroup and outgroup that have implications for the maintenance or enhancement of ingroup resources, values, and well-being [8]. Ingroups and outgroups division can easily occur based on the size difference between the groups. Although racial demographics of the users on Yelp is not publicly available, it would be interesting to find out which ethnic group is most affected and would be considered as outgroups. Donalson et al. suggests that different social and personal aspects can influence how someone can respond to questions posed by organizational researchers with bias. Some factors included agreement among co-workers, propensity to give socially de- sirable reponses, and fear of reprisal [13]. This paper seeks to find out if this type of bias in social psychology is also reflected on the online business industry, especially focusing on racial aspects. 16 2.3 Bias in Recommender Systems Many machine learning algorithms are initially introduced as innovative mediums that can learn and make predictions without giving explicit commands to a computer. However, researchers have started to notice the evidence of algorithm bias over the years.

Load more