Use of Online Reviews As Quality Metrics: a Comparison of Airline Reviews on Twitter and Skytrax

Use of Online Reviews as Quality Metrics: A Comparison of Airline Reviews on Twitter and Skytrax Lin Lu Amit Mitra Yu Wang Pei Xu Fairfield University Auburn University Auburn University Auburn University [email protected] [email protected] [email protected] [email protected] Abstract Online reviews serve as a key data source in the service industry as they can be accessed in real-time and is a reflection of customers’ changing focus on service aspects. The current study analyzes reviews of major U.S. airlines on Skytrax and Twitter during 2014 – 2019, and examines if the online review-derived weighted service quality (WSQ) score can be a good indicator of airline service quality. We find that the topics distributions and the WSQ scores significantly vary between the non-social media and social media platforms, and WQS score from social media is a better predictor of the objective measure - airline quality rating (AQR). Keywords: Social Media, Service Quality, Topic Model, Sentiment Analysis, Panel Regression 1. Introduction Service quality is essential for retaining customer patronage (Chen & Hu, 2013; Ostrowski et al., 1993) and market share (Aksoy et al., 2003). This is especially true for airlines as multiple services encounter stages that can affect passengers’ satisfaction towards partial or overall aspects of service (Brochado et al., 2019). To measure service quality, two dominant approaches are perceptual surveys and reported operational metrics. Despite the utility of being able to build on standardized research designs (Lim & Lee, 2020), using surveys have typical drawbacks considering the time consumed to collect complete datasets (Kothari, 2004), the restricted research expandability (Lee & Bradlow, 2011), and sample size limits (Kotrlik & Higgins, 2001). Also, researchers have suggested the use of operational metrics as it is the perception rather than operational performance that drives customers’ attitudes (Tiernan et al., 2008). It is possible that the perception is worse than the operational metrics suggest, and airlines need to address more than specific areas of objective service failure. Nowadays, as review websites and social media have become potent channels for consumers to post service-related issues and/or rate their satisfaction, online reviews started to be a vital source of collecting and monitoring customer feedback (Siering et al., 2018). Compared to surveys and operational metrics, online reviews are publicly accessible, can be collected in real-time, and can reflect customers’ changing focus on service. The overarching goal of this study is to examine if online reviews derived metrics can serve as the measure of airline service quality. To achieve this goal, we propose the following research questions: (a) identify the service aspects and aspect-specific sentiment from online airline reviews over the past years; (b) test if there are any differences of online reviews derived metrics between social media and non-social media platforms; and (c) examine relationships between weighted service quality (WSQ) scores and the industry standard airline quality rating (AQR). The remainder of this paper is organized as follows. The following section provides a theoretical background in hypotheses formation. Section 3 details the methods we use in data acquisition, data preprocessing, sentiment analysis, topic modeling, and hypotheses testing. We discuss our experimental results in evaluating the utility of online reviews on service quality in Section 4. 2. Hypotheses Development The theoretical base framing this research comes from the Importance-Performance Analysis (IPA) (Sever, 2015; Martilla & James, 1977). The IPA framework considers importance as a weight of a service aspect’s performance to identify areas that need attention. We extend this idea in constructing the online review metrics for airline service quality. In terms of the importance dimension, it would be of interest to know if customers put different emphases on service aspects when the types of platforms they use are different. Social media websites, such as Twitter, with a simplistic design of 140 characters limit of a post, has been one of the world’s most popular social media channels that appeals to both airlines and customers in sharing information and interacting in real-time to address issues before, during, and after service. On the other hand, non-social media websites such as Skytrax, a leading airline review site, customers typically write long reviews post service and the text entry can reach a maximum of 3500 characters. Skytrax also asks for ratings in aspects of ground service, seat comfort and cabin staff service, etc. Thus, we propose the following hypothesis. H1: The distributions of service topics are different between social media and non-social media online platforms. Further consideration of the performance dimension brings us to question that whether customers’ perceptions of airline quality are different between the two platforms. Since it is the service quality or the overall impact that matters, we combine importance and performance of service aspects using the Weighted Service Quality (WSQ) score (Table 1 describes the WSQ construction). We compare this aggregated metrics of Skytrax and Twitter, and conservatively assume WSQs are the same between the two platforms. H2: The WSQ scores of an airline are different between the two types of online platforms. Table 1: Service Quality Metrics Metrics Descriptions Symbols Service aspect Topic cluster Importance of Percentage of sentences in a service aspect topic i among all k topics Performance of mean sentiment of the topic a service aspect Weighted summation of importance * service quality performance for all topics (WSQ) score To validate the online review-derived metrics in measuring airline service quality, we examine the relationship between WSQs and the operational metrics, airline quality rating (AQR). If WSQ scores can be significant in predicting AQR, it will add evidence for using online reviews and this constructed measure WSQ to analyze airline's service quality. We propose the third hypothesis. H3: The weighted service quality (WSQ) score from social media platforms is a better indicator of AQR than non-social media platforms. 3. Methods We utilize a data-drive approach that consists of four main phases, as shown in Figure 1. In phase I, reviews during January 1, 2014 – June 30, 2019 are collected from both Skytrax and Twitter for 10 U.S. airlines. The exact review start date for each company varied based on when the first comment occurred. Tweets that @ the official airline accounts of interests (@AlaskaAir, @Allegiant, @AmericanAir, @Delta, @FlyFrontier, @HawaiianAir, @JetBlue, @SouthwestAir, @SpiritAirlines, @united) are crawled using a Python module (“GitHub - taspinar/twitterscraper: Scrape Twitter for Tweets”). The Airline Quality Rating (AQR) data is extracted from annual airline reports at airlinequalityrating.com as the industry standard operational metrics. I. Data Acquisition II. Data Preprocessing III. Topic Modeling IV. Hypotheses Testing • Skytrax • Data filtering and sampling • word2vec modeling • Topic distriBution • Twitter • Sentence tokenization • Non-negative matrix comparisons • AQR • Sentiment analysis factorization • Comparison of the • SuBjectivity analysis • Visualization online review derived metrics Between • Word tokenization and platforms lemmatization • Validity of the online • Stop words removal review derived metrics • Data integration Figure 1. Flowchart of Data Analysis Data preprocessing is conducted in phase II to clean the noise among online reviews. Only tweets originated from customers are included for further analysis and initial official accounts’ postings that not directly relate with service are screened out. Stratified random down sampling by airline, year, and month is conducted to enable efficient exploration of tweets. Regardless of which platform, each review is split into sentences, and a sentiment score is then calculated for each setence. Subjective words within each sentence are also identified and removed. We propose this removal of subjective words to avoid topics featuring by polarities rather than entities (e.g., negative sentences form a cluster themselves). Its effectiveness is verified by deriving and comparing topics with both non-removal and removal of subjective words. The above sentence split, sentiment analysis, and subjective words detection are achieved using Textblob package in Python. Word lemmatization by tag is conducted utilizing Wordnet Lemmatizer. Finally, stop words and symbols are removed to obtain the clean preprocessed text. In reflecting these changes, both original and preprocessed review sentences are preserved in the resulting dataset. We use the non-negative matrix factorization (NMF) to discover service aspects based on features from term frequency inverse document frequency (TF-IDF). Before this topic modeling, the optimal number of clusters is determined by topic coherence, which is calculated via our trained word2vec model (Mikolov et al., 2013). Two authors, with a background in business, reviewed the featured words and sentences of each topic cluster. The subjective labeling is given if reaching consensus; else, the third author will comment until a consistent labelling is reached. To test if the online review topics distributions significantly vary between the two platforms, Chi-square test is used based on the crosstab of the count of sentences. The paired

Load more