<<

Sharknado Social Media Analysis with SAP HANA and Predictive Analysis Mining social media data for customer feedback is perhaps one of the greatest untapped opportunities for customer analysis in many organizations today. Social media data is freely available and allows organizations to personally identify and interact directly with customers to resolve any potential dissatisfaction. In today’s blog post, I’ll discuss using SAP Data Services, SAP HANA, and SAP Predictive Analysis to collect, process, visualize, and analyze social media data related to the recent social media phenomenon .

Collecting Social Media Data with SAP Data Services While I’ll be focusing primarily on the analysis of social media data in this blog post, social media data can be collected from any source with an open API by using Python scripting within a User-Defined Transform. In this example, I’ve collected data using the basic outline provided by SAP in the Data Services Text Data Processing Blueprints available on the SAP Community Network, updated it for the REST version 1.1 Twitter API. This process consists of 2 dataflows, the first tracks search terms and constructs (Get_Search_Tasks transform) and executes (Search_Twitter transform) a Twitter search query to store the data pictured below. In addition to the raw text of the tweet, some metadata is available, including user name, time, and location information (if the user has made it publicly available).

Once the raw tweet data has been collected, I can use either the Text Data Processing transform in SAP Data Services or the Voice of Customer text analysis process in SAP HANA. While both processes give the same result, SAP Data Services is also able to perform preliminary summarization and transformations on the parsed data within the same dataflow. In this case, I will run text analysis in SAP HANA by running the command below in SAP HANA Studio.

Create FullText Index "VOC" On

() TEXT ANALYSIS ON CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER';

This results in a table called $TA_VOC in the same schema as the source table, as shown below.

In this table, the TA_TOKEN—called SOURCE_FORM in SAP Data Services TDP—is the extracted entity or element from the tweet (for example, an identifiable person, place, topic, organization, or sentiment), TA_TYPE (called TYPE in SAP Data Services TDP) is the category the entity falls under. These are the two main text analysis elements used to extract information from Twitter data.

For a more in-depth explanation on Text Data Processing and social media analysis using SAP Data Services, refer to the Decision First Summer EIM Expert Series webinar on Twitter data collection and social media sentiment analysis by Nicholas Hohman.

Once the Twitter data was loaded into SAP HANA and text analysis had been performed, I created an Analytic View and several Calculation Views to allow for visualization and analysis.

In the first Analytical View pictured above, I’ve cleaned up the TYPE categories a bit further to consolidate into top level categories (for example, combining all types of Organizations into one single Organization category) and assigned a numeric sentiment values to each sentiment-type entity as shown in the table below, ranging from 0 (strong negative sentiment) to 1 (strong positive sentiment).

I then created a calculation view that aggregates data to the tweet-level and calculates tweet-level flags for analysis, including flags to indicate whether key types of entities are found in each tweet (location, topic, Twitter hashtag, retweet, sentiment, etc). This also aggregates the average sentiment based on any sentiments found within the tweet. I’ll use these aggregated metrics later for visualization and predictive analysis of the Twitter data.

The final outputs of the SAP HANA modeling process are 2 analysis sets:

1.) A tweet-level analysis set with aggregated flags and values summarizing the tweet, including tweet length, number of extracted entities within the tweet, and the metadata collected with the tweet, such as location, time, and the user information. 2.) An entity-level analysis set with tweet-level metadata joined back to the individual entities to allow analysis at the entity level.

While these analysis sets could be created using a SAP Data Services ETL process, the SAP HANA Information Views have the advantage of being calculated on the fly rather than as a batch process, so if we are continuously monitoring and collecting Twitter data, users will have real-time access to social media trends and insights without having to wait for an overnight or batch process to finish.

Visualization and Analysis of #Sharknado Data For this analysis, I collected over 33,000 tweets related to the topic “sharknado” over a period of days. After Text Analysis was performed, over 200,000 individual entities were extracted from these tweets. A natural first step is generating descriptive charts to explain the nature of these extracted entities and tweets. The figure below shows an area chart of all the entities extracted from the tweets by category. Twitter hashtags were the most commonly identified entities, followed by sentiments, Twitter users, topics, and organizations. The depth of color indicates the tweet-level average sentiment. This shows that tweets with topic entities have the highest (most positive) overall sentiment, while tweets with hashtags are much less positive.

A few other fast facts on the Sharknado tweets:

 38% of the tweets collected include a retweet from another user  41% of tweets have a topic entity extracted from the text  7.5% of tweets have a location entity within the tweet text  45% of tweets have a sentiment entity identified in the text  54.5% of tweets have 5 or more entities extracted from the text  The chart below shows a histogram of tweets by the length of the tweet text—tweets are most commonly right around the 140 character limit, with about 25% of tweets at 135 characters and above.

Now, we can start to examine the individual entities extracted from the tweets and sentiments associated with each entity. For example, we can pull the Person entities identified by the text analysis in a word cloud, shown below. This word cloud shows the most common entities (larger size) and the sentiment associated with the person entities (depth of color).

This shows that , Cary Grant, Tatiana Maslany, , and Steve Sanders were the most commonly identified person entities, with Tatiana Maslany and Tara Reid appearing in tweets with higher average sentiments. Tara Reid and Ian Ziering are actors that appeared in Sharknado, and Steve Sanders was Ian Ziering’s character in Beverly Hills, 90210, but I was confused by the appearance of Cary Grant, whom Wikipedia identifies as an English actor with “debonair demeanor” who died in 1986, and Tatiana Maslany, a lesser-known Canadian Actress, neither of whom appeared in Sharknado. Further filtering the tweet text for these particular entities, I find an extremely high retweet frequency for 2 influential tweets:

@TVMcGee: #Sharknado is even more impressive when you realize Tatiana Maslany played all the different sharks.

@RichardDreyfuss: People don't talk about it much in Hollywood (omertà and everything) but Cary Grant actually died in a #sharknado The entity “impressive” was strongly positive for Tatiana Maslany, while “n’t talk” was considered a minor problem for the Cary Grant tweet. Further analysis can be done to identify popular characters and portions of the movie, which the Sharknado filmmakers can mine to identify the characters, plots, or topics to revisit in the already-approved sequel to Sharknado (coming Summer 2014).

Similarly, investigating location entities shown in the word cloud below, we can see the most common references are to Texas and Hollywood, with tweets about Texas being more positive than Hollywood.

Organizations identified by Text Analysis show (the channel that brought you Sharknado) and the phrase Public Service Announcement, as well as Lego and Nova were common in tweets, as shown in the word cloud below.

The SyFy and public service announcement phrases were found in a frequently retweeted tweet about a re-airing of the movie: @Syfy: Public Service Announcement: #Sharknado will be rebroadcast on Thurs, July 18, at 7pm. Please retweet this important information.

Nova was a character in the movie who may have met an untimely end, which apparently did not elicit positive sentiments. The Lego topic/organization was also in a commonly re-tweeted tweet of a picture of a sharknado made of Legos.

@Syfy: OMG OMG OMG someone made #Sharknado out of LEGOs!!! http://t.co/0ORVv6w2uf http://t.co/lbjJ6DDvzU

Predictive Analysis on #Sharknado Data After summarizing and visualizing the data, I can leverage SAP Predictive Analysis’s Predict pane to evaluate the models using predictive algorithms. We can further summarize tweet data across multiple numeric characteristics using a clustering algorithm. Clustering is an unsupervised learning algorithm and one of the most popular segmentation methods; it creates groups of similar observations based on numeric characteristics. In this case, the numeric characteristics available are: length of tweet, # of entities extracted from the tweet, and the presence of a topic or a sentiment flag. While binary variables are not technically appropriate to use in a clustering model, we’re including them here to increase the complexity of our model and make the results more interesting.

The clustering model results show 3 groups of tweets, roughly separated by size, with Cluster 3 being the short tweets, Cluster 1 the longer tweets, and Cluster 2 between 3 and 1. This clustering model does show us that longer tweets were more likely to have more entities identified by the text analysis and were more likely to have a sentiment and a topic within the tweet.

While this is an extremely simple example, with additional descriptive statistics we could cluster tweets according to sentiment and occurrences of key phrases or words; if the organization could link these tweet segments to customer satisfaction or other key metrics (such as referrals generated through social media buzz or calls to a customer service center), monitoring the frequency of tweets by segment would be a great, nearly real-time leading indicator of viral buzz, customer complaints, or referral business.

Another potential application for predictive models would be attempting to estimate the impact of tweet characteristic on the sentiment value of the tweet. In this case, I’ve arbitrarily determined that a tweet with an average sentiment of 0.4 or higher is “Positive”. I can then use the R-CNR Decision Tree algorithm or a custom R function for Logistic Regression (see this previous blog on Custom R Modules) to predict which elements are most indicative of positive tweets. In order to compare these models, I use a filter transform to filter out tweets without sentiments. Then, I configure the Logistic Regression and R-CNR Tree modules to include all my descriptive data, including tweet length, number of entities extracted, and presence of location and topic entities.

Once this predictive workflow has been run, I can review results for the logistic regression and decision tree results. Logistic Regression results These model output charts show that the logistic regression model is not terribly predictive, showing an AUC (area under the ROC Curve) of only 0.598 (AUC varies from 0 to 1 with a baseline of 0.5 and values closest to 1 indicating more accurate predictions).

This chart shows that there is a slight increase in predicted average sentiment (red line) across the actual average tweet sentiment (x axis). Blue bars represent tweet volume for each level of average sentiment. Ideally, the red line would be approximately diagonal from bottom left to top right.

Decision Tree results The Decision tree shows that the model is able to identify large pockets of tweets that are much more likely to be positive.

Pockets of highly- positive tweets

In summary, the models show potential to distinguish tweet positivity based on tweet content characteristics. These models could be further tuned for accuracy with more Sharknado-related characteristics, such as whether the tweet mentioned specific plot points, emotions, or characters. In these preliminary models, results suggest that having a location entity, longer tweet length, and presence of a retweet contribute to positive sentiments. Perhaps this suggests that people are more likely to retweet positive tweets than negative?

Adding presence of key terms like “chainsaw” or “shark” or specific character names could be used as input predictors and we would be able to see the impact of those specific terms on sentiment positivity. Developers of the Sharknado sequel, could determine which specific aspects of the film were most positively and negatively received by the audience and incorporate these concepts into the sequel.

Tips for Social Media Data Collection and Analysis Based on this experiment, I have a few recommendations for approaching a similar problem going forward.  Implement custom data dictionaries and custom categorizations: Using custom data dictionaries, we could have the text data processing step immediately identify key terms that are related to our particular topic. In this case, we could have created a custom dictionary with character names, plot points, or key terms like “chainsaw” or “shark”. These terms might not be recognized by the “standard” text analysis dictionaries, but they will help us automatically pull out and identify entities that are important in our particular scenario.  Scrape profanity and irrelevant tweets immediately: One thing I noticed when pulling in Sharknado-related tweets was an abundance of profanity and Twitter spam. Scraping out profanity is important if the tweet data is going to be included in Business Intelligence reports or shared with others within the organization. Similarly, setting up policies to eliminate or avoid spam-related Twitter accounts may help keep the feedback data more pure. I noticed accounts that would tweet a message like “Get 500 followers free” and include the top 5 hashtags trending on Twitter at the time. These tweets made up a huge portion of the data I collected, and should have been immediately discarded based on the repetitive text so as not to influence frequency and sentiment analysis.  Construct descriptive attributes: Probably the most important part of this process is constructing descriptive attributes for each of the tweets. These may include flags to indicate whether the tweet included a key entity or category, length fields, or perhaps user information that can be collected about the poster. These attributes might be related to the custom data dictionaries relevant to the topic.  Identify and treat retweets differently: While the re-tweeted data is valuable in gauging influence and frequency of the social media buzz, it can bias the sentiment analysis by overwhelming the average sentiment with copies of the same information. Therefore, flagging tweets that contain retweeted information and excluding those from some sentiment analysis might eliminate sentiment bias of a single opinion or phrase that was retweeted many, many times.

Implementation of Sentiment Analysis Data While the Sharknado example is a fun pop culture phenomenon, how does this become relevant to a real-world organization? Collecting Twitter data relevant to an organization could provide nearly free focus group-like feedback directly from customers who are most likely to influence their peers. For example, a hotel chain could collect Twitter data not only from users that mention its brand name, but also from users mentioning competitors’ names or just talking about hotels in the general sense. They can get an idea of what contributes to positive and negative sentiments about hotels. Do negative sentiments most commonly accompany comments about cleanliness? Noise? Wait to check in? Staff? Do positive sentiments stem from amenities like the pool or gym? What is the general sentiment for customers of your hotel chain versus competitors? And are there particularly negative sentiments for users of one particular location that might indicate a serious problem?

Furthermore, having this type of feedback available in a nearly real-time environment allows organizations to monitor, respond to, and leverage social media buzz to increase audience or revenue for the organization. For example, when SyFy executives saw the volume of social media posts and response to the initial Sharknado airing, SyFy was able to quickly schedule subsequent showings, commit to a sequel, and arrange for the film to make its theatrical debut in response and disperse this information via Twitter while the topic was still trending. This equates to increasing awareness and future audience at a very low cost. If the SyFy had missed this window, they would have to expend significant marketing funds to re-generate this level of buzz. In fact, by leveraging this strong social media buzz around the initial airing of Sharknado, SyFy actually garnered higher viewership with the re- airing than they experienced during the initial premier.

This type of feedback can give insight not only to what users might think about your organization’s brand overall, but also could give an idea of the importance that specific product aspects hold in a user’s experience. Understanding how the consumer values these factors could guide investment decisions or marketing strategies by highlighting the features that customers care about and those that are not meaningful.

Hillary Bliss, Analytics Practice Lead Decision First Technologies [email protected] twitter @HillaryBlissDFT

Hillary Bliss is the Analytics Practice Lead at Decision First Technologies, and specializes in data warehouse design, ETL development, statistical analysis, and predictive modeling. She works with clients and vendors to integrate business analysis and predictive modeling solutions into the organizational data warehouse and business intelligence environments based on their specific operational and strategic business needs. She has a master’s degree in statistics and an MBA from Georgia Tech.