CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Eat-Smart: A Restaurant Recommendation Web Application

Using Machine Learning and Yelp Dataset

A graduate thesis project submitted in partial fulfillment of the requirements

For the degree of Master of Science in Computer Science

By

Frank Fan Cao

May 2018

The thesis of Frank Fan Cao is approved:

______

Dr. Adam Kaplan Date

______

Dr. Robert McIlhenny Date

______

Dr. Jeff Wiegley, Chair Date

California State University, Northridge

ii

Acknowledgements

I would like to thank my committee chair, Dr. Jeff Wiegley for providing support, motivations, and expert knowledge during the entire thesis process.

I would like to express my gratitude to Dr. Adam Kaplan and Dr. Robert

McIlhenny for their help and guidance throughout the program.

Lastly, I would like to thank professors for all my coursework. I am grateful for the knowledge you have shared with me.

iii

Table of Contents

Signature Page ...... ii

Acknowledgements ...... iii

List of Tables ...... v

List of Figures ...... vi

Abstract ...... vii

1. Introduction ...... 1

2. Data Collection and Preprocessing ...... 4

3. Recommendation Model ...... 11

4. Model Testing and Optimization ...... 15

5. System Design ...... 18

6. Recommendation Web Service Implementation ...... 22

7. Web Application Implementation Using ...... 27

8. Web Application Features...... 34

9. Conclusion and Future Work ...... 40

Reference ...... 41

iv

List of Tables

Table 1 Contents of Yelp Data Files ...... 4 Table 2 Business Data Distribution ...... 6 Table 3 Collaborative Filtering Algorithm Analysis ...... 13 Table 4 ALS Parameters ...... 15 Table 5 Grid Search Results for Rank and Maximum Iterations ...... 16 Table 6 RMSE Results from Grid Search in Regularization Parameters ...... 17 Table 7 List of Views in Eat-Smart ...... 30 Table 8 List of Controller Methods ...... 32

v

List of Figures

Figure 1 Machine Learning Workflow ...... 2 Figure 2 Yelp Dataset Overview...... 4 Figure 3 MySQL Database Schema ...... 5 Figure 4 Python Script for Filtering Business Data ...... 8 Figure 5 Python Script for Filtering Review Data ...... 9 Figure 6 Comparison of Storage File Format ...... 10 Figure 7 Collaborative Filtering Example ...... 11 Figure 8 Python Script for Spark Model Creation ...... 14 Figure 9 System Architecture Diagram ...... 19 Figure 10 Web service Workflow ...... 22 Figure 11 App.py content ...... 25 Figure 12 Laravel MVC ...... 27 Figure 13 Eloquent Model Definition ...... 28 Figure 14 Master Blade Template...... 29 Figure 15 Child Template ...... 30 Figure 16 Laravel Controller Example ...... 31 Figure 17 Laravel Routing Example ...... 32 Figure 18 Web Application Routes...... 33 Figure 19 Home Page for Guest...... 35 Figure 20 Home Page for Authenticated Users ...... 36 Figure 21 User Reviews Page ...... 36 Figure 22 Business Detail Page ...... 37 Figure 23 Add Review Page ...... 38 Figure 24 Confirmation Page for Successful Review Submission ...... 38 Figure 25 Top Recommendation Page ...... 39

vi

Abstract

Eat-Smart: A Restaurant Recommendation Web Application

using Machine Learning and Yelp Data

By

Frank Fan Cao

Master of Science in Computer Science

People are increasingly relying on reviews from other people to decide which items to buy, which movies to watch, which books to read and where to eat.

Traditionally, these questions are answered with peer recommendation (words of mouth, blog posts and reviews) or expert advice (columnist, librarian). Crowd-sourced business review platforms are ubiquitous now. Apps such as Yelp, TripAdvisor and Google provide a wide range of information of local businesses, especially restaurants. The seemingly unlimited options for food and services contribute to information overload as users often struggle to make informed choices that cater to their individual wants and needs. One solution to the problem is a recommender system that provides accurate and personalized recommendations. This will greatly reduce the effort and time needed to discover new restaurants.

vii

The core of the application is a recommendation engine with an appropriate prediction model. The prediction models are built using data from the Yelp Challenge

Dataset, which contains detailed information of 1.1M business, 4.1M reviews and 947K tips by 1M users. Various machine learning algorithms, such as K-Means, SVM, and

Collaborative Filtering are investigated and benchmarked. Ultimately, Collaborative

Filtering using Alternative Least Square is chosen because it offers good balance of accuracy and performance.

The second part of the application is a web application built around the recommendation engine using the newest web technology and framework available. The application provides personalized and relevant recommendations to users with high prediction accuracy.

viii

1. Introduction

1.1 Background

Recommender system is a system of methods that filters through large observation and information spaces to provide predictions in the information space that users do not have any observations yet. In simpler words, recommender system provides predictions for items that users have not rated yet. Recommender system has long existed in various fields and disciplines. Online retailer Amazon uses recommendation system to recommend new products after users purchase or search for certain items. Social networking application such as LinkedIn uses recommender system to suggest new connections. Streaming service like Netflix uses similar systems to recommend movies and music based on user’s previous choices and search history. Following this trend, the project focuses on creating a recommendation system for Yelp users to accurately predict potential food choices based on their previous experiences.

1.2 Development Overview

The project development adheres to the typical machine learning workflow. The workflow consists of four stages: data collection, data preparation, model training and model integration, which is illustrated in Figure below.

1

Figure 1 Machine Learning Workflow The data collection is a trivial task in the scope of this project. Because of its popularity and many years of operation, Yelp has acquired vast amount of data regarding user preferences and behaviors. The wealth of information in these data could lead to creative and insightful observations. Part of the data is available to developers and data scientists, courtesy of Yelp. Raw data files containing semi-structured data are downloaded from the Yelp’s developer site to local servers awaiting analysis and processing.

After data is obtained from Yelp, it is cleaned, matched, and analyzed. First, the data is stored in relational databases for ease of access. Upon initial analysis of the datasets using SQL queries and other data analytics tools, subsets of the datasets are created. Data fields and headers that are irreverent or outside the scope of this project are removed. Data is transformed and restructured to match the input requirements of various machine learning algorithms.

Once the datasets are properly cleaned, they are used to construct models using a variety of machine learning algorithms such as K-Means, Collaborative filtering, and

2

Content based filtering. After models are fitted properly, they are tested for accuracy.

Depending on the results, improvements can be made by fine-tuning parameters or revising the algorithms. Best models are serialized and stored on disk for later access.

Finally, models with best results are integrated into a web service so that external applications can use the model. A simple and intuitive web application that recommend new restaurants to users based on food preferences and previous dining experience is built around the web services. The application collects user’s food preferences, geological locations, and other user data and input the data into the recommendation engine web service. After the application receives a response from the recommendation engine, results are displayed in the front-end UI in a visually pleasing format.

In the following sections, the machine learning workflow, algorithm used, testing methodology, and system architectures will be explained in detail.

3

2. Data Collection and Preprocessing

2.1 Data Source

The data comes from Yelp Dataset Challenge [1]. The challenge aims to gain innovative and new insights into user behaviors, user preferences, and food trends using the provided data. Figure 2 below gives an overview of the data in the dataset.

Figure 2 Yelp Dataset Overview The Yelp dataset consists of five JSON files. These data files contain information of business, users, check-ins, tips and reviews. Table 1 below shows the content of each

JSON file.

Table 1 Contents of Yelp Data Files

Business.json Address, GPS coordinates, star rating, hours, categories (e.g. “Asian”, “Chinese”), attributes (e.g. “take out”, “pets”, “WIFI”), Review.json Rating, review text, date, votes received from other users (useful, funny, or cool) User.json Name, review count, friends, compliments received

Check-in.json Number check-ins for every hour and day of the week

Tip.json Tip text, compliments received (e.g. “likes”)

4

2.2 Data Storage

Before selecting a suitable algorithm and building models, the data needs to be prepared and transformed. The raw data files contain millions of rows of data. To get a better understanding of the data, a relational database is set up.

Figure 3 MySQL Database Schema

5

To simplify data migration, the tables follows the same structure as the original data files. Note that tables are not normalized because the database is purely for exploring the dataset and performance is not a major concern.

2.3 Initial Data Analysis

One of the most important factors in machine learning is data density and distribution. The data in the dataset spans over 12 metropolitan areas. However, the Yelp dataset has over 99% sparsity [2] because some areas contains very sparse data. Many machine learning algorithms tend to perform poorly when the input data has high sparsity. Therefore, better result can be achieved by reducing the sparsity of the data.

Also, during the later stages of development, performance become a major problem due to the large size of the dataset. Therefore, a decision is made to use only a subset of the original dataset.

To determine how to filter the dataset, initial analysis on data sparsity is performed using SQL database queries. Table 2 below lists the top ten cities ranked by number of businesses and average reviews per business.

Table 2 Business Data Distribution

City Avg # Reviews/Business # Business/City

Las Vegas 56.59086143630963 22892 Toronto 24.050550206327372 14540 31.564348907934754 14468 Scottsdale 35.43515975133729 6917 Charlotte 26.54267939814815 6912 Pittsburgh 27.130995260663507 5275 Montréal 20.169278996865202 4785

6

Mesa 20.755621552821385 4714 Henderson 33.460929250263995 3788

In general, machine learning algorithms produce more accurate models if the sparsity of the data is lower (higher data density). Also, the dataset should be sufficiently large. From table 2, it is obvious that Las Vegas not only has the highest average number of reviews for all businesses, it also contains roughly 1/10 of the number of business in the original dataset. In addition, the restaurants and users are closely group together geologically. Thus, it is a good place to investigate.

The new dataset is much easier to work with. The average query time for the recommendation engine is reduced by a factor of 10. Initially, there were concerns regarding recommendation accuracy as result of the reduce dataset size, but tests have shown similar margin of error when building model using both datasets.

2.4 Business Data Transformation

All data are processed and transformed using Apache Spark [6]. Apache Spark is a fast and powerful open-source data processing engine for large-scale data processing.

Spark uses dataframe for transient data storage, which is a type of tabular data object.

Since all Spark tasks operate entirely in memory, data is loaded into dataframes and cached in memory.

To filter the business data, all business data are first converted to a dataframe named “businessDF”. Spark offers a variety of API’s on dataframes that cover most common database operations such as “filter” and “format”. The business data is filtered by geological locations. A search grid of approximately 10 x 8 square miles centered

7 around Las Vegas, NV was used to filter the business data in Las Vegas. To achieve this, we first obtain the GPS coordinates for the Las Vegas city center. Then longitude and latitude offsets are set to 0.075, which roughly covers the search grid. The filter functions are used in a factory pattern to filter the data based on latitudes and longitudes.

Filtering for Las Vegas business based on location vegas_lat = 36.114647 , vegas_lon = -115.172813 lat_range = 0.075, lon_range = 0.075 businessDF_vegas = businessDF.filter('latitude between {} and {}' \ .format(vegas_lat - lat_range, vegas_lat + lat_range)) \ .filter('longitude between {} and {}' \ .format(vegas_lon - lon_range, vegas_lon + lon_range)).cache()

Figure 4 Python Script for Filtering Business Data After running the above code, the resulting dataset “businessDF_vegas” contains all business in Vegas. Since the application is primary for recommending restaurants and food, data is further filtered using the category data contained in “businessDF_vegas” dataframe. All entries containing Food or Restaurant in their attribute field are saved to a new dataframe named “businessDF_vegas_food”. This further reduces the size of the input to machine learning algorithms.

2.5 Review Data Transformation

To further clean the data, we decide to only include the reviews for Las Vegas business since all other reviews are irrelevant to prediction model for Las Vegas users.

The review data was loaded from JSON file to a data frame called “reviewDF”. A right join was performed from the “reviewDF” to the “bussineDF_vegas” using the common field: “business_id”. The resulting dataframe “reviewDF_vegas” contains reviews in

8

Las Vegas only. Next, a user-business association matrix is obtained by selecting the used id, business id, and rating from the “reviewDF_vegas” dataframe. This data will be input into the machine learning algorithms to build the recommendation model. All above operations are performed using the Python script in Figure 5.

#perform a right join between reviewDF_vegas and bussinessDF reviewDF_vegas = reviewDF.select('business_id','user_id', col('stars').alias('stars_long'))\ .join(businessDF_vegas, 'business_id', 'right').cache()

#Select relevant column for ALS algorithm. reviewDF_vegas_clean=reviewDF_vegas. select('user_id','business_id','stars_long')

Figure 5 Python Script for Filtering Review Data 2.6 Data Storage for Processed Data

Since data operations such as joins are very expensive in term of computation.

Data must be serialized and saved to disk to reduce recomputing as much as possible.

Neither the original JSON format nor MySQL database is sufficient for high volume data retrieval. After some research, we decide to use Apache Parquet [7] as the data format of choice. The Apache Parquet is a columnar storage format for any project in the Hadoop ecosystem [8]. Parquet is built to support very efficient compression and encoding schema with enhanced performance to handle complex data in large quantity. Parquet has much faster read and write performance than its JSON counterpart.

Additionally, parquet files are much smaller than the JSON files. In our case, the read and write performance increases by 300% and the storage size reduced to 1/3 of the original JSON format. All datasets are stored in AWS S3 bucket for convenient data

9 retrieval and updates. Figure 6 below illustrates the performance of various data storage formats.

Figure 6 Comparison of Storage File Format

10

3. Recommendation Model

3.1 Collaborative Filtering

Collaborative filtering is widely adopted in recommender systems. Netflix and

Amazon all use collaborative filtering to create customized experience for their users.

This technique predicts the missing entries of a user-item association matrix. In general, collaborative filtering uses the following intuitions:

▪ Personal tastes correlate.

▪ Users who agreed with each other before are more likely to agree in future

▪ To predict the rating of the user, use users who have similar taste.

Figure 7 shows how the algorithm predicts missing entries in the User-Item matrix.

Figure 7 Collaborative Filtering Example

11

The underlying assumption is that if A agrees with B on some issues, A is likely to share B’s opinions on other issues. To predict the user rating for an unrated item, collaborative filtering algorithm looks into on the ratings from other users with similar rating histories (users with green rows). Because of this logic, the field marked with “?” should be updated to thumb down.

Mathematically, the underlying mechanics of collaborative filtering is matrix factorization combining with cost minimization problem. The user-item matrix M is as the following, where r is actual rating value:

푟, 푖푓 푢푠푒푟 푢 푟푎푡푒 푖푡푒푚 푖 푅 = { 푢푖 0, 푖푓 푢푠푒푟 푢 푑푖푑 푛표푡 푟푎푡푒 푖푡푒푚 푖

The user-item matrix R is U x M, where U is a matrix of factors that describes each user and M is a matrix of factors that describes each item. Since we do not know either variable, alternative least squares(ALS) with regularization is used to approximate R.

ALS first estimates the item factor matrix using the user factor matrix and vice versa.

After enough iterations, a convergence point will be reached where either matrix changes much. To ensure U x M approximates R, the following cost function is minimized:

( ) 2 | | 2 | | 2 푓 푈, 푀 = ∑ 푊푖,푗(푟푖,푗 − 푣푖 × 푚푗) + λ (∑ 푛푣푖| 푣푖 | + ∑ 푛푚푗| 푚푖 | ) 푖,푗 푖 푗

3.2 Collaborative Filtering Analysis While collaborative filtering generally produces good results, it has its advantages and disadvantages. Table 3 lists some of the key advantages and disadvantages of collaborative filtering algorithm in general.

12

Table 3 Collaborative Filtering Algorithm Analysis

Advantages Disadvantages ▪ No need to know about item ▪ Sparsity of user preference content. makes it hard to find similar ▪ Take account of change of user. user interests over time. ▪ Difficult to scale because large ▪ Can capture subtle things matrices require more because the use of latent factor computation for predictions ▪ Some of users might not consistently agree with other users. User might be irrational at times.

One key advantage of CF is its ability in capturing subtle things in the dataset.

The latent factors can sometimes connect users that seem unrelated from content-based machine learning algorithms. Content-based machine learning focus on the items themselves. Each item is categorized, tagged and labeled in some way in order to find similar items. However, this leads to a cold-start case problem. If some business data is incomplete and absent, the algorithm will not produce meaningful results. Collaborative filtering solves this problem because as long as some users reviewed a business, the algorithm still provides reasonable prediction for that business even if the business has incomplete information.

3.3 Building the Model

For this project, we will use model-based collaborative filtering from the Spark

MLlib [9]. MLlib is a machine learning framework specific to Apache Spark. Once the

Spark engine is initiated, an instance of the ALS algorithms is created with associating parameters. The prepared dataset is fed into the ALS instance. The result is a prediction model object that can be used to estimate and transform the user-item matrix.

13

Figure 8 shows how an ALS algorithm instance is created in Spark. The constructor takes in algorithm-related parameters and returns an ALS algorithm instance object. The ALS instance is then fitted with preprocessed data to obtain a model. Lastly, model data is persisted on disk for later use. als = ALS(rank= 8, maxIter=20, regParam=0.25, userCol="user_id_int", itemCol="business_id_int", ratingCol="stars_long", coldStartStrategy="drop") model = als.fit(training_set) model_save = os.path.join('dataset','als_model_vegas.parquet')

Figure 8 Python Script for Spark Model Creation

14

4. Model Testing and Optimization

4.1 Testing Methodology

To test the accuracy of the prediction model, Root-mean-square deviation or

RMSE is used. RMSE represents the sample standard deviation of the difference between predicted value and observed values. RMSE is often used to find out the magnitude of errors.

To test the effectiveness of the ALS algorithm, the dataset was randomly divided in two subsets with the first set consists of 80% of data and the second set consists of the remaining 20%. The former is used to build the predictions model, while the latter is used for testing. Using the established model from the first set, a prediction is generated for entries in the second. By comparing the predicted value and the actual values, the root- mean-square-deviation is obtained.

4.2 Model Training and Optimization

The ALS algorithm implementation in the Spark MLlib has the following parameters, shown in Table 4 below.

Table 4 ALS Parameters

Parameters Description numBlocks The number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10). rank The number of latent factors in the model (defaults to 10). maxIter The maximum number of iterations to run (defaults to 10). regParam Specifies the regularization parameter in ALS (defaults to 0.1). implicitPrefs Specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data

15

(defaults to false which means using explicit feedback).

Model built using the default ALS parameters yields an average RMSE of 1.7331 over ten iterations, which is way too high for accurate prediction of user preferences. A reasonable margin of error should be much closer to 1.0. According to the Spark official documentation [4], rank, maximum iteration, and regularization parameter has the most influence on the margin of error. Rank of [8,10,12] and Maximum Iterations of

[12,14,16] are used to run the first series of grid search. The results suggest that the two parameters have somewhat limited effect on RMSE. there is no noticeable reduction in

RMSE beyond rank of 12 and maximum iterations of 16, while the cost of computation increases significantly.

Table 5 Grid Search Results for Rank and Maximum Iterations

Iteration Rank Maximum Regularization RMSE number Iterations Parameter 1 8 10 0.1 1.7331

2 8 12 0.1 1.7121

3 8 14 0.1 1.7056

4 8 16 0.1 1.7024

5 10 12 0.1 1.7090

6 10 14 0.1 1.7056

7 10 16 0.1 1.7023

8 12 12 0.1 1.6991

9 12 14 0.1 1.6983

16

10 12 16 0.1 1.6974

Next, regularization parameters: [0.1,0.15,0.20, 0.25] are tested with rank of 12, and maximum iterations of 16. The best result is obtained with regularization parameter set at 0.25.

Table 6 RMSE Results from Grid Search in Regularization Parameters

Iteration Rank Maximum Regularization RMSE number Iterations Parameter 1 12 16 0.1 1.7231

2 12 16 0.15 1.4531

3 12 16 0.20 1.2345

4 12 16 0.25 1.1312

5 12 16 0.30 1.1312

6 12 16 0.35 1.1312

17

5. System Design

5.1 System Design Overview

The system consists of two components: the main web server/s and recommendation web service. Each component is hosted on a different server and communicates with each other over HTTP. There are many benefits of decoupling server roles to multiple servers. First, the application would be easy to evolve and maintain because changes in one components does not affect the other if the API remains intact.

Each service can be maintained independently. Each component only needs to understand what an API does but not how it is implemented.

Secondly, application is easier to scale with this design. New web servers could be added quickly to the system. There are two ways of scaling an application: horizontal and vertically. Vertical scaling means the capacity of individual server are increased.

However, vertical scaling is sometimes expensive and inconvenient. On the other hands, horizontal scaling is done by adding more server instances to the system. Horizontal scaling requires no data migration and can often be implemented with little downtime.

Other considerations for the design are performance related. The backbone of the recommendation web service, Apache Spark processes data in memory entirely. During data processing, memory usage sometimes exceeds 10GB. In a typical testing environment, where 16GB of memory is the norm, there is not enough memory to host the web server and database instances without experiencing slow-downs.

In addition, decoupling solving the problem of single point of failure. If one component goes down, other components still maintains its functions. For instance, when

18 the recommendation web service is down, the web server could still handle user request to search and review nearby businesses. Decoupling also makes debug and troubleshoot easier because problems can be isolated quickly.

Figure 9 System Architecture Diagram Figure 9 above is a high-level system diagram describing the system architecture.

Typical workflow for the system is as the following. The client accesses the web front- end using internet browsers. After the user provides information on the front-end, requests are sent to the back-end API. The back-end will parse and sanitize the request data. Then, the back-end will request recommendations from the recommendation web service using the request data. In the recommendation web service, each request triggers a series of function calls within the recommendation engine. Once data processing is

19 complete, a response containing the recommendation result is returned to the web server back-end, that consumes the response and renders the result on the web page.

5.2 Recommendation Web Service Development

The most important consideration for the recommendation web service is light- weight because it serves a very specific role and requires high-performance. Therefore, the web services should be simple and light-weight. Most web service development stacks include impressive arrays of features but most of the features are not used and significant overheads are required. The ideal web stack is a barebone web service stack with only the essential functions required by the specification.

Python Flask [10] is used to implement the recommendation web service because of its simplicity and modular design. Flask is a micro written in Python.

Flask aims to keep the core simple but extensible. There is little boilerplate code required to get application up and running. Developers can implement various functionalities via extensions and changes are easy and clean. There are numerous extensions providing functionalities such as database integration, form validation, and authentication. The simple and extensible nature of Flask is ideal for this project.

In addition, CherryPy [11] is used to provide HTTP layer functionalities on top of the Flask application. CherryPy a reliable, HTTP/1.1-compliant, WSGI thread-pooled webserver written in Python. It is easy to run multiple HTTP servers on multiple ports at once. There is also a powerful configuration system and flexible build-in tools that supports high level of customization. Comparing to Python Flask, CherryPy is much

20 more suitable for a production server because it offers more features such as reliable server logging and debugging tool.

5.3 Web Application Development

The web application is developed using MVC design pattern. The MVC pattern consists of three separate components: Model, View, Controller. The goal of MVC is to separate View(UI) from Model (Data). In the scope of this project, the model is the application database. The view renders the user interface elements (displaying restaurant choices), and the controller responds to user interface actions such as initiating a new query. The benefit of MVC is the loose coupling between the UI and the business logic.

Both UI and business logic can be developed independently of each other. It also increases flexibility and adaptability.

For these reasons above, Laravel [12] was chosen as the web development framework of choice because it adheres closely to the MVC principle. Laravel is a free, open-source PHP web framework that has simple, expressive syntax. Laravel follows the

MVC design closely, ensuring clarity between logic and presentation. This architecture helps to improved performance and provides well-organized, reusable and maintainable code. Laravel also simplifies tasks such as routing, authentication, caching, sessions and database migrations. Lastly, Laravel features some of the best authentication features including HTTP authentication, user authentication, and login throttling. All of features above of makes Laravel a solid choice for developing content-rich websites.

21

6. Recommendation Web Service Implementation

The recommendation web service provides recommendation to external applications via RESTful API’s. The figure below illustrates how each module communicates within the web service. The following sections will explain the function of each module and module interaction.

Figure 10 Web service Workflow The web service is implemented using Python Flask [7] and CherryPy [8]. All modules are defined in the following three Python files:

• Engine.py defines the recommendation engine, which is a wrapper class for

Apache Spark session and data operations.

22

• App.py is a Python Flask web application wrapped around the recommendation

engine. It provides RESTful web API on top the recommendation engine so that

external applications can access the recommendation engine.

• Server.py is a CherryPy web server that maintains connections between the

application and the Apache Spark session. The server provides HTTP functions

on top of the Flask application.

6.1 Engine Module

The core of the recommendation web service in the engine module. The

Engine.py file defines the recommendation engine class with constructor and a list of methods related to Apache Spark data operations. The engine has four main components: class initialization, model training, adding ratings, and recommendation.

Upon initialization of the recommendation engine instance, the constructor is automatically called to load a persisted model from storage that was previously trained.

Dataframes containing preprocessed data are also loaded and cashed in memory. By default, the recommendation engine instance does not retrain models every time the server restart to speed up the processes. Instead, a method called train_model () handles requests for model retraining.

The train_model () method handles model retraining. It loads newly preprocessed data from storage and instantiates a new ALS algorithm instance and builds a new recommendation model. Model is then persisted to storage.

The add_ratings () method takes in a rating object, that contains user id, business id, ratings, and other related information. The rating object is parsed and sanitized first.

23

Then, the processed data is persisted in a staging area to be inserted into the database later.

The predict_rating () method take in a user-item matrix object that contains empty ratings. The user-item matrix is then input into the recommendation model. The recommendation model transforms the user-item matrix by filling all the cells in the matrix with predicted ratings.

6.2 Application Module

The application class wraps around the engine class and provides a list of routes that corresponds with methods in the engine class. Figure 11 below shows the entire content of the App.py. The primary function of the App module is to define API routes. There are three API routes definition that correspond with methods in the recommendation engine class:

• GET /rating/{user_id}/top/{count} obtains top N recommendations for a single

user.

• GET /rating/{user_id}/top/{business_id} obtains the predicted rating of a

business for a single user.

• POST /rating/{user_id} adds new rating from a user to the main datasets

24

main = Blueprint('main',__name__) @main.route("/ratings//top/", methods=["GET"]) def top_recommendation(user_id, count): top_ratings = recommendationengine.get_top_ratings(user_id, count); return json.dumps(top_ratings) @main.route("/ratings//top/", methods=["GET"]) def business_rating(user_id, business_id): rating= recommendationengine.get_individual_rating (user_id, business_id); return json.dumps(rating) @main.route("/rating/", methods = ["POST"]) def add_ratings(user_id): ratings_json = request.get_json( ) recommendationengine.add_ratings(ratings_json) return json.dumps(ratings) def create_app(spark_session,dataset): global recommendationengine recommendationengine = RecommendationEngine(spark_session,dataset) app = Flask(__name__) CORS(app) app.register_blueprint(main) return app

Figure 11 App.py content The create_app () method takes in the Spark Session and dataset path as input. It then creates a new engine instance using the input parameters. Finally, it configures the application itself and return the application instance to the server. The create_app () method is always called from the server module when the server starts.

25

6.3 Server Module

The server class uses the CherryPy framework. The server class contains two methods: loadSpark () and run_server (). The loadSpark () method initializes and configures a new spark session and return the spark session. The run server methods take in a Flask application object as input and host the application in the web server.

Server.py is the entry point of the web services. When it is submitted to the

Apache Spark, it invokes the loadSpark () methods to establish a connection between the server instance and Spark. Next, the create_app () from App.py method is called to initialize a new Flask instance. Finally, the run_server () method takes the Flask application instance and host it on the server.

26

7. Web Application Implementation Using Laravel

This section focuses on explaining the MVC design and how it relates the Laravel framework. The workflow of the Laravel Application will be explored. The figure below shows a typical workflow of a Laravel Application.

Figure 12 Laravel MVC 7.1 Model

In the context of the MVC concepts, model is the representation of data. In this project, the model is the SQL database that contains user and system data. One of

Laravel’s unique features is the Eloquent model. The Eloquent models provide an elegant way to represent database tables and related data operations as PHP objects. All database tables and related operations can be encapsulated in the Eloquent models. Long SQL queries are replaced by short and expressive function calls like how objects behave in object orient languages such as and C#.

By convention, each Eloquent Model is mapped to a table that has a “snake case’ name, which is the plural name of the mode name in lower case. For example, the User

27 model is mapped to a database table named “users”. Like classes in object orient languages, the Eloquent contains property and methods. To define an object relation, a method is placed in the model class that specifies which model is related to the current model and the relationship. There are four common model relation methods used in

Laravel: hasMany (), hasOne (), belongsTo (), belongsToMany (). Relationships such as one-to-one, one-many, and many-to-many can be defined using these methods. class User extends Model {

public function roles() { return $this->belongsToMany('App\Role','role_user'); //user can have many roles } public function reviews(){ return $this->hasMany('App\Review'); //user can have many reviews } }

Figure 13 Eloquent Model Definition There are four eloquent models implemented in this project:

• User Model contains the user data and has a many-to-many relationship with the

Role Model and a one-to-many relationship with the Review Model.

• Role Model contains user role data and has a many-to-many relationship with the

User Model.

• Review Model contains rating data for businesses and has a many-to-one

relationship with the User Model and a many-to-one relationship with the

Restaurant Model

28

• Restaurant Model contains detail information of restaurants and has a one-to-

many relationship with the Review Model.

7.2 View

The view contains all the front-end elements of the application. The view in

Laravel consists of HTML, JavaScript, and CSS. By default, Laravel uses Bootstrap,

Vue.js, and SASS to render highly responsive and dynamic web content. To combine these elements, Laravel uses the Blade Template engine.

Two primary benefits of using the Blade template are template inheritance and sections. Most web applications maintain a general layout across entire website to achieve a uniform user experience and visual styles. Laravel has a “master” page layout defined in the app.blade. file. Figure below shows the default master template.

App Name - @yield('title')

@yield('content')

Figure 14 Master Blade Template

29

The master template contains the boilerplate code such as site background, navigation bar, header, and footer. Sections are block of code that can be defined in any

Blade file. The sections can be rendered using the yield function. In practice, all views in

Laravel inherits from the master template. Any yield function in the master template can be overridden by its children. In the figure below, the child inherits from the master template by using the extends () function and overrides the section named “content” that corresponds with the yield () function in the master template.

@extends('layouts.app')

@section('content')

….

@endsection

Figure 15 Child Template Table 7 below lists all views implemented in Eat-Smart and their functionalities.

Table 7 List of Views in Eat-Smart

View Description

GuestHome.blade.php The default view for guest users.

Home.blade.php View for authenticated users

UserReview.blade.php View that displays a user’s past reviews

Recommendation.blade.php Makes recommendation to user

BusinessDetail.blade.php Displays the detail information of a business

AddReview.blade.php Used by users to add new reviews.

30

7.3 Controller

The controller by its name, controls the interactions between the Views and

Models. The controllers in Laravel contain the business logics for the application.

Features are represented as functions inside of the controller class. A controller function is called when the Laravel routing engine received a URI request that maps to that function. When a controller function is invoked, it will obtain data from the Model and the View and performs the business logic. After the business logics are performed, the controller method always returns an object, usually a view or a redirect, along with the processed data. To make the data available to any front-end logic in view, the data is

“compacted” into the return object in the controller. In the example below, the function userReview () is mapped to an “GET” API call. The function takes the input parameter which is a user id and use it to find the user data in the User Model. After obtaining the data, the controller function returns a view called “userReview” that is defined in the view folder, along with the processed data. public function userReview($id){

$user = User::where('id',$id)->first(); return view('userReview',compact('user')); }

Figure 16 Laravel Controller Example Each view in this project corresponds with at least one controller method. A single controller named “HomeController” is implemented in this project. Below is a list of controller functions and their related views.

31

Table 8 List of Controller Methods

Methods in HomeController Description Related View Index () Renders the home page GuestHome.blade.php based on user Home.blade.php authentication status userReview ($id) Query the Review UserReview.blade.php model using the provided $id. Return the view along with review data addReview (Request $request) Parse the POST data AddReview.blade.php and add the review data to the Review Model businessDetail ($id) Query the Restaurant BusinessDetail.blade.php model using the provide $id. Return the view along with restaurant data Recommendation ($id) Request Recommendation.blade.php recommendation from the recommendation web service using the provided $id. Return the view along with recommendation data.

7.4 Routing

All routes in Laravel applications are registered in the app/routes/web.php file.

This file tells Laravel which controller function to call when the user requests a certain

URI. Below is an example of a route definition. Each route is defined using the Route

Facades, which provide a static interface to Route Model in Laravel.

Route::get('/home', 'HomeController@home');

Figure 17 Laravel Routing Example

32

In each route definition, the HTTP method (GET, POST, UPDATE) is specified first.

The URI is then mapped to a function in a controller. When a user requests the “/home”

URI, a function named “home” in the “HomeController” class is called. Figure 19 below lists all the routes implemented in Eat-Smart.

Route::get('/', 'HomeController@index'); Route::get('/home', 'HomeController@home'); Route::get('/restaurant', 'HomeController@restaurant'); Route::get('/recommendation/{id}/{count}','HomeController@recommendation'); Route::get('/businessDetail/{id}','HomeController@businessDetail'); Route::get('/restaurantDetail/{id}', 'HomeController@restaurantDetail');

/*****Reviewer related routes*******/ Route::get('/userReview/{id}', 'HomeController@userReview')->middleware('auth'); Route::get('/userProfile/{id}', 'HomeController@userProfile')->middleware('auth'); Route::get('/addReview/{id}', 'HomeController@addReviewForm')->middleware('auth'); Route::post('/addReview/{id}', 'HomeController@addReview')->middleware('auth'); Route::get('/changePassword/{id}', 'HomeController@PasswordForm')->middleware('au- th'); Route::post('/changePassword/{id}', 'HomeController@changePassword')->middle- ware('auth');

Figure 18 Web Application Routes

33

8. Web Application Features

8.1 Overview

Eat-Smart allows user to search, view and rate restaurants, while collecting user preferences and experiences to perform personalized recommendations. The primary feature is the search function to find restaurant and food options in a specific location.

User can review and bookmark businesses and the data will be collected by the application. Once sufficient data is obtained from the user, recommendations are generated using a recommendation model. The following sections demonstrates the features of all pages in the web application.

8.2 Home Page for Guest

When a user visits the web application, the first thing they see is a portal page where they can search and view nearby restaurants, register for an account or logging in.

The navigation menu contains a list of items that redirects to different site features. All visitors have access to the above features. This view is rendered by the controller when user request the web page using the URI “/home”.

The search bar is located on top of the page. User can enter search terms in the first text field and the search location in the second text field. User can also add categorical filters by clicking on the checkboxes next each category. When user clicks on the search button, a request containing the search parameters is sent to the back-end.

Once the back-end API responds to the request, the search result is rendered instantly without a page refresh thanks to AJAX [13], which is a technology for displaying web content dynamically.

34

Each item in the search result contains a featured picture of the business with categorical tags. The name of the business, rating in stars, address, and contact information are also displayed. User can view additional business information by clicking the view Business Detail button.

Figure 19 Home Page for Guest 8.3 Home Page for Authenticated User

This view is rendered after the user logs in on any page in the application. The home page for authenticated users contains all features of the guest home plus some additional features such as viewing past reviews, managing profiles, and getting recommendations. The additional features are located on the right side of the navigation bar and grouped together. By grouping related features in proximity, users can easily find the features they want. To log out, users can simply click on the logout menu item located in the user collapsible menu and they will be redirected to the guest home page.

35

Figure 20 Home Page for Authenticated Users 8.4 User Reviews Page

In this page, user can view past reviews. A list of user reviews containing restaurant name, review tittle, review timestamp, rating, and review body is rendered.

The user can update each review by clicking on the Update Review button.

Figure 21 User Reviews Page

36

8.5 Business Detail Page

The Business Detail page displays information of a single restaurant such as overall review, review count, location, and hours. User can add new reviews to the business by clicking on the Add Review button, which redirect user to the Add Review page.

Figure 22 Business Detail Page 8.6 Add Review Page

The add review page allows the user to rate a business with the following parameters: Stars, review tagline, and review body.

37

Figure 23 Add Review Page When a review is added successfully, the user is redirected to the Restaurant Detail page and a confirmation message confirming the user’s action will be on top of the page.

Figure 24 Confirmation Page for Successful Review Submission 8.7 Top Recommendation Page

The top recommendation page allows users to get recommendation from the recommendation engine. After user clicks on the button on top of the page, a request containing data such as user id, location, and other information is sent to the backend.

Laravel will make a request to the recommendation web service next. Once the web service responds, the results are displayed in a list format. Each item in the list contains brief information of the business and a prediction value. If the prediction rating is greater than 4.0, a “highly recommended” tag is displayed. If the prediction value is greater than

38

3.0, a “recommended” tag is displayed. By default, businesses with predicted rating below 3.0 are not recommended.

Figure 25 Top Recommendation Page

39

9. Conclusion and Future Work

Eat-Smart is a user-friendly application built around a robust and efficient recommendation engine. The application provides quick responses and make it easier for users to find nearby restaurants that they will enjoy. The average response time for each user request is less than 10 seconds for result size of up to 500 items.

The recommendation engine provides personalized recommendation based on user’s previous dining experience and trends. The application is a better alternative to the traditional method of peer recommendation or expert advices. User will spend less time exploring and contemplating different food options. This leads to a more satisfying user experience. The application could also mine the user data such as reviews, demographics information, or search history, which will result in more accurate and fine-tuned recommendations beyond what traditional method could offer.

While the application in its current state satisfy the original design requirements for this project, there are room for improvements. With more time, I would like to implement the following features.

1. Improved recommendation engine using other machine learning algorithms such

as NLP, Clustering based and combining the result for better prediction accuracy.

2. Better and more efficient data pipeline to handle larger datasets while maintaining

fast response time. Expands the application to include more metropolitan areas.

3. Implements more sophisticated user data collections to collect implicit feedbacks

(time stayed on page, number of time user views a business, etc.).

4. Allow users to personalize their experience on web UI.

40

Reference

[1] Yelp, Yelp Dataset Challenge, https://www.yelp.com/dataset/challenge

[2] A. Ihler et al., Recommender Systems Designed for Yelp.com http://www.math.uci.edu/ icamp/summer/research/student_research/ recommender_systems_slides.pdf

[3] Zan Huang, Hsinchun Chen, and Daniel Zeng, Applying Associative Retrieval Techniques to Alleviate the Sparsity Problem in Collaborative Filtering, http://hdl.handle.net/10150/105493

[4] Charumathi Lakshmanan, Prem Nagarajan, Prem Nagarajan, Unbiased Restaurant Recommendation using Yelp review data, https://cseweb.ucsd.edu/classes/wi17/cse258- a/reports/a078.pdf

[5] Rahmtin Rotabi A Preference-Based Restaurant Recommendation System for Individuals and Groups. Team Size: 3, https://www.cs.cornell.edu/~rahmtin/Files/YelpClassProject.pdf

[6] Apache Spark, https://spark.apache.org/

[7] Apache Parquet, https://parquet.apache.org/

[8] Hadoop, http://hadoop.apache.org/

[9] Apache MLlib, https://spark.apache.org/mllib/

[10] Flask, http://flask.pocoo.org/

[11] CherryPy, https://cherrypy.org/

[12] Laravel, https://laravel.com/

41

[13] AJAX, https://developer.mozilla.org/en-US/docs/Web/Guide/AJAX

[14] JSON, https://www.json.org/

42