Sentiments and Opinions from Twitter About Peruvian Touristic Places Using Correspondence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Sentiments and Opinions From Twitter About Peruvian Touristic Places Using Correspondence Analysis Luis Cajachahua Indira Burga Universidad Nacional de Ingeniería Universidad de Ingeniería y Tecnología [email protected] [email protected] Abstract very valuable, because it allows to discover and develop the main attributes of touristic places in Tourism in Perú has become very im- Perú, considering not only the place itself, but all portant, since there is a growing number of the ecosystem: hotels, touristic agencies, handi- tourists arriving each year. This paper fo- craft stores, transportation, public services, etc. cus in understand what do speaking- english tourists have in consideration For that reason, we downloaded 192,525 when they visit Perú. We obtained all the tweets, published during 2016, considering only tweets published in english during the year english language. After a careful data preparation 2016, filtered by touristic places visited. In and feature extraction, we applied some sentiment total, more than 192 thousand tweets were analysis techniques to tag each tweet using the collected. We performed different analysis following eight sentiments: anger, fear, sadness, to describe the data, including correspond- disgust, surprise, anticipation, trust and joy, pro- ence analysis, a statistical technique which posed by Plutchik (Mohammad and Turney, is normally applied to categorical data. 2010). Using correspondence analysis, we discov- The goal was to understand the sentiments ered some associations between touristic places and opinions expressed in those tweets. visited and sentiments expressed in the extracted Keywords: Twitter, Tourism, Sentiment tweets. Analysis, Text Mining, Correspondence In addition, we used other text mining tech- Analysis, Perú. niques to determine some general concepts men- tioned on tweets and identify the association be- 1 Introduction tween the places visited and these concepts. Tourism is an important source of economic 2 Background growth in Perú. In 2015, 3.5 million of tourists visited Perú, leaving more than USD 4,151 mil- Text mining has become a useful tool to deal lion that correspond to the 3.75% of the gross do- with the data deluge. With few and simple tools, mestic product (Promperú, 2016). For this reason, we can now extract valuable insights from a big all the government tourism bureaus, specially amount of unstructured data available today in the Promperú, are working hard to keep those num- internet. For example, one of the first studies of bers growing on Figure 1. sentiment analysis in Twitter was developed in 2011 by Agarwal et al. They built a formula to calculate the polarity of a comment, if the tweet is positive or negative. They developed a tree kernel approach to improve the classification (Agarwal, et al., 2011). In the same year, Dodds et al analyzed a huge list of tweets, around 4.6 billion, posted by over 63 million individual users and extracted in a 33- month span (Dodds, et al., 2011). The main goal Figure 1: Tourist arrivals to Perú in the last 10 was to develop a dictionary optimized to measure years (Promperú, 2016). the level of happiness of a given text, e.g. another tweet. They named their solution as “The Hedom- In this context, every initiative oriented to un- eter” derstand the tourist preferences and sentiments is 178 One of the first studies about tourism using ñ Being Twitter an internet social network, all tweets was developed by Antoniadis. This study the users must be registered and logged to was focused in verifying the correlation between use it. But not every person or tourist in the Twitter performance and tourism competitive in- world has a Twitter account. For this reason, dex for 38 destination management organizations results are not generalizable/representative (DMO’s). They found that Twitter use was in ac- for the tourists’ population. cordance with countries’ tourism performance (Antoniadis, et al., 2014). ñ We acquired all the tweets directly from Oku et al, analyzed 4.5 million tweets to find Twitter. We didn’t used streaming or scrap- the location where a tourist posted a photo. They ing methods to download the tweets. But, used only geotagging information from the pic- we used many filters to select tweets re- tures, leaving text data out of the analysis (Oku, et ferred to touristic places. In consequence, we could have selected a non-representative al., 2015). sample of tweets. In addition, Bassolas performed an interesting analysis about the 20 most interesting touristic ñ We considered only original posted tweets. places around the world, including Machu Picchu. We are not considering retweets. But, there They tried to measure the “attractiveness” of each are many platforms that post news or copies place, considering two metrics: a) Radius, defined of tweets, e.g. IFTTT. Therefore, we could as the average distance between the places of resi- find the exact same tweet published by dence and the touristic site and b) Coverage, the many different users. area covered by the users’ places of residence computed as the number of distinct zones (or ñ Some Tweets (10%) include the location of countries) of residence (Bassolas, et al., 2016). the user but this information is not always All these studies were considered to develop a real which makes classification by origin different approach, focused on finding the emo- difficult. Besides, the georeferenced infor- tions expressed by the tourists in the visited plac- mation is not representative. es. 3 Methodology 2.1 Objectives Having reviewed the references and settled the This research was focused in five goals: scope of the study, we selected some tools and ñ Understand the sentiment expressed in the techniques available to analyze the information. tweet (positive or negative) about the place. 3.1 Scope ñ Tag each tweet using the sentiments that the The population considered for this study was user is expressing in the tweet formed by 192,525 tweets downloaded directly from Twitter. The distribution of tweets is shown ñ Identify the main qualitative attributes men- in the Figure 2. tioned in the tweets, related to the touristic place. ñ Analyze the possible relationships between the places and the identified attributes. ñ Explore the general activity of the tourists which use twitter (frequent, eventual). Having those objectives in mind, we deter- mined the limits of our efforts, given the nature of the information source and the tools used. 2.2 Limitations Figure 2: Tweets per month during 2016. There are various limitations to be considered: 179 3.2 Text Mining a) Drop Retweets, Hashtags and mentions. Text Mining is the methodological process of b) Drop strange symbols information extraction, where researchers deal c) Drop webpages with collections of documents, using specialized d) Drop punctuation signs analysis methods (Balbi, et al., 2012). e) Drop numbers The complete process we have followed is f) Drop tabs or multiple spaces shown in Figure 3. We collected tweets from 12 We frequently found spelling mistakes in the touristic places from Perú, that were suggested by tweets. For that reason, we used a spelling correc- Promperú, using words related to those places. We tor. E.g. we replaced ‘speling’ by ‘spelling’ with used Python with GNIP PowerTrack® API to re- an algorithm in Python (Dean and Bill 2007), for trieve and download the information. the Stemming we used SAS Text Miner and The sintax used to obtain tweets was: (lang:en) NLTK (Bird, et al., 2009). Finally, we removed (machupicchu OR picchu OR sacsayhuaman OR the most common stopwords such as ‘a’, ‘the’, camino inca OR cusco OR cuzco OR plaza Armas ‘an’, etc. from the datasets. cusco OR plaza armas cuzco OR ollataytambo After the cleaning process, we tried to under- OR salinas maras OR qorikancha OR coricancha stand the dataset, analyzing the comments, using OR ccoricancha OR pisac OR aguas calientes OR different R packages (qdap, koRpus, tm). SAS moray OR valle sagrado OR nasca lineas OR Text Miner was helpful for Topic Modeling and nazca lineas OR paracas OR islas ballestas OR Term Relationship diagrams. Finally, R was used manglares tumbes OR puerto pizarro OR parque for Sentiment Classification (library syuzhet) and reserva lima OR larcomar OR plaza armas lima Correspondence Analysis (function corresp). OR parque kennedy OR catedral lima OR pacha- We have pre-defined some “concepts”, using camac OR monasterio santa catalina OR cañon the Text Profile Node of SAS Text Mining (SAS colca OR colca OR uros OR taquile OR sillustani Institute Inc. 2013). It performs a Hierarchical OR cathedral cusco OR cathedral cuzco OR main Bayesian Model which predict the concepts asso- square lima OR nazca lines OR nasca lines OR ciated to each touristic place. We have grouped ballestas island OR main square cusco OR square some terms in concepts, after lemmatization, as- kennedy OR santa catalina monastery OR colca sociated to positive and negative characteristics of canyon OR sacred valley OR inca trail OR inKa the attractions visited. trail OR cordillera blanca OR huaraz OR llanga- In our case, we also identified, all these words, nuco OR amazon river peru OR mancora OR vi- considering verbs, to represent the activities that chayito OR madre dios OR manu park OR iquitos tourists do in the places and adjectives, to repre- OR tarapoto OR kuelap OR moche OR titicaca sent the opinions or value perceived by the tourist. OR lake 69 OR laguna 69 OR señor sipan OR lord sipan OR pullana OR misti volcan OR misti - Wordclouds - Correspondence Analysis volcano OR yanahuara OR huascaran OR churup - Perceptual Maps Characterization OR chavin OR mochica OR pacaya samiria OR - Term relationships - Topic Modeling tucume OR cumbemayo OR baños inca OR wari - Tagging/Classification Tagging and Classification OR punta sal OR amotape OR catacaos OR ba- - Ortographical Review huaja sonene OR cotahuasi OR tingo maria) - Terms filtering - Cleaning Pre-processing During initial pre-processing of the datasets, all - Download tweets tweets where labelled with the cities and touristic - Format - Preparation places, by matching words related to the places.