Sharknado Social Media Analysis with SAP HANA and Predictive Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Sharknado Social Media Analysis with SAP HANA and Predictive Analysis Mining social media data for customer feedback is perhaps one of the greatest untapped opportunities for customer analysis in many organizations today. Social media data is freely available and allows organizations to personally identify and interact directly with customers to resolve any potential dissatisfaction. In today’s blog post, I’ll discuss using SAP Data Services, SAP HANA, and SAP Predictive Analysis to collect, process, visualize, and analyze social media data related to the recent social media phenomenon Sharknado. Collecting Social Media Data with SAP Data Services While I’ll be focusing primarily on the analysis of social media data in this blog post, social media data can be collected from any source with an open API by using Python scripting within a User-Defined Transform. In this example, I’ve collected Twitter data using the basic outline provided by SAP in the Data Services Text Data Processing Blueprints available on the SAP Community Network, updated it for the REST version 1.1 Twitter API. This process consists of 2 dataflows, the first tracks search terms and constructs (Get_Search_Tasks transform) and executes (Search_Twitter transform) a Twitter search query to store the data pictured below. In addition to the raw text of the tweet, some metadata is available, including user name, time, and location information (if the user has made it publicly available). Once the raw tweet data has been collected, I can use either the Text Data Processing transform in SAP Data Services or the Voice of Customer text analysis process in SAP HANA. While both processes give the same result, SAP Data Services is also able to perform preliminary summarization and transformations on the parsed data within the same dataflow. In this case, I will run text analysis in SAP HANA by running the command below in SAP HANA Studio. Create FullText Index "VOC" On <table name>(<tweet text column name>) TEXT ANALYSIS ON CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'; This results in a table called $TA_VOC in the same schema as the source table, as shown below. In this table, the TA_TOKEN—called SOURCE_FORM in SAP Data Services TDP—is the extracted entity or element from the tweet (for example, an identifiable person, place, topic, organization, or sentiment), TA_TYPE (called TYPE in SAP Data Services TDP) is the category the entity falls under. These are the two main text analysis elements used to extract information from Twitter data. For a more in-depth explanation on Text Data Processing and social media analysis using SAP Data Services, refer to the Decision First Summer EIM Expert Series webinar on Twitter data collection and social media sentiment analysis by Nicholas Hohman. Once the Twitter data was loaded into SAP HANA and text analysis had been performed, I created an Analytic View and several Calculation Views to allow for visualization and analysis. In the first Analytical View pictured above, I’ve cleaned up the TYPE categories a bit further to consolidate into top level categories (for example, combining all types of Organizations into one single Organization category) and assigned a numeric sentiment values to each sentiment-type entity as shown in the table below, ranging from 0 (strong negative sentiment) to 1 (strong positive sentiment). I then created a calculation view that aggregates data to the tweet-level and calculates tweet-level flags for analysis, including flags to indicate whether key types of entities are found in each tweet (location, topic, Twitter hashtag, retweet, sentiment, etc). This also aggregates the average sentiment based on any sentiments found within the tweet. I’ll use these aggregated metrics later for visualization and predictive analysis of the Twitter data. The final outputs of the SAP HANA modeling process are 2 analysis sets: 1.) A tweet-level analysis set with aggregated flags and values summarizing the tweet, including tweet length, number of extracted entities within the tweet, and the metadata collected with the tweet, such as location, time, and the user information. 2.) An entity-level analysis set with tweet-level metadata joined back to the individual entities to allow analysis at the entity level. While these analysis sets could be created using a SAP Data Services ETL process, the SAP HANA Information Views have the advantage of being calculated on the fly rather than as a batch process, so if we are continuously monitoring and collecting Twitter data, users will have real-time access to social media trends and insights without having to wait for an overnight or batch process to finish. Visualization and Analysis of #Sharknado Data For this analysis, I collected over 33,000 tweets related to the topic “sharknado” over a period of days. After Text Analysis was performed, over 200,000 individual entities were extracted from these tweets. A natural first step is generating descriptive charts to explain the nature of these extracted entities and tweets. The figure below shows an area chart of all the entities extracted from the tweets by category. Twitter hashtags were the most commonly identified entities, followed by sentiments, Twitter users, topics, and organizations. The depth of color indicates the tweet-level average sentiment. This shows that tweets with topic entities have the highest (most positive) overall sentiment, while tweets with hashtags are much less positive. A few other fast facts on the Sharknado tweets: 38% of the tweets collected include a retweet from another user 41% of tweets have a topic entity extracted from the text 7.5% of tweets have a location entity within the tweet text 45% of tweets have a sentiment entity identified in the text 54.5% of tweets have 5 or more entities extracted from the text The chart below shows a histogram of tweets by the length of the tweet text—tweets are most commonly right around the 140 character limit, with about 25% of tweets at 135 characters and above. Now, we can start to examine the individual entities extracted from the tweets and sentiments associated with each entity. For example, we can pull the Person entities identified by the text analysis in a word cloud, shown below. This word cloud shows the most common entities (larger size) and the sentiment associated with the person entities (depth of color). This shows that Tara Reid, Cary Grant, Tatiana Maslany, Ian Ziering, and Steve Sanders were the most commonly identified person entities, with Tatiana Maslany and Tara Reid appearing in tweets with higher average sentiments. Tara Reid and Ian Ziering are actors that appeared in Sharknado, and Steve Sanders was Ian Ziering’s character in Beverly Hills, 90210, but I was confused by the appearance of Cary Grant, whom Wikipedia identifies as an English actor with “debonair demeanor” who died in 1986, and Tatiana Maslany, a lesser-known Canadian Actress, neither of whom appeared in Sharknado. Further filtering the tweet text for these particular entities, I find an extremely high retweet frequency for 2 influential tweets: @TVMcGee: #Sharknado is even more impressive when you realize Tatiana Maslany played all the different sharks. @RichardDreyfuss: People don't talk about it much in Hollywood (omertà and everything) but Cary Grant actually died in a #sharknado The entity “impressive” was strongly positive for Tatiana Maslany, while “n’t talk” was considered a minor problem for the Cary Grant tweet. Further analysis can be done to identify popular characters and portions of the movie, which the Sharknado filmmakers can mine to identify the characters, plots, or topics to revisit in the already-approved sequel to Sharknado (coming Summer 2014). Similarly, investigating location entities shown in the word cloud below, we can see the most common references are to Texas and Hollywood, with tweets about Texas being more positive than Hollywood. Organizations identified by Text Analysis show SyFy (the channel that brought you Sharknado) and the phrase Public Service Announcement, as well as Lego and Nova were common in tweets, as shown in the word cloud below. The SyFy and public service announcement phrases were found in a frequently retweeted tweet about a re-airing of the movie: @Syfy: Public Service Announcement: #Sharknado will be rebroadcast on Thurs, July 18, at 7pm. Please retweet this important information. Nova was a character in the movie who may have met an untimely end, which apparently did not elicit positive sentiments. The Lego topic/organization was also in a commonly re-tweeted tweet of a picture of a sharknado made of Legos. @Syfy: OMG OMG OMG someone made #Sharknado out of LEGOs!!! http://t.co/0ORVv6w2uf http://t.co/lbjJ6DDvzU Predictive Analysis on #Sharknado Data After summarizing and visualizing the data, I can leverage SAP Predictive Analysis’s Predict pane to evaluate the models using predictive algorithms. We can further summarize tweet data across multiple numeric characteristics using a clustering algorithm. Clustering is an unsupervised learning algorithm and one of the most popular segmentation methods; it creates groups of similar observations based on numeric characteristics. In this case, the numeric characteristics available are: length of tweet, # of entities extracted from the tweet, and the presence of a topic or a sentiment flag. While binary variables are not technically appropriate to use in a clustering model, we’re including them here to increase the complexity of our model and make the results more interesting. The clustering model results show 3 groups of tweets, roughly separated by size, with Cluster 3 being the short tweets, Cluster 1 the longer tweets, and Cluster 2 between 3 and 1. This clustering model does show us that longer tweets were more likely to have more entities identified by the text analysis and were more likely to have a sentiment and a topic within the tweet.