<<

Some Challenging Problems in

Huan Liu Joint work with

Shamanth Kumar Ali Abbasi Reza Zafarani Fred Morstaer Jiliang Tang

Arizona State University and Lab Some Challenging Problems in Mining Social Media April 22, 2014 1 Social Media Mining by Cambridge University Press

hp://dmml.asu.edu/smm/ Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 22 Tradional Media and Data

Broadcast Media One-to-Many

Communicaon Media One-to-One Tradional Data

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 3 Social Media: Many-to-Many

• Everyone can be a media outlet or producer • Disappearing communicaon barrier • Disnct characteriscs – User generated content: Massive, dynamic, extensive, instant, and noisy – Rich user interacons: Linked data – Collaborave environment, and wisdom of the crowd – Many small groups (the long tail phenomenon) – Aenon is expensive

Arizona State University 4 Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 4 Unique Features of Social Media

• Novel phenomena observed from people’s interacons in social media • Unprecedented opportunies for interdisciplinary and collaborave research – How to use social media to study human behavior? • It’s rich, noisy, free-form, and definitely BIG – With so much data, how can we make sense of it? • Pung “bricks” into a useful (meaningful) “edifice” • Developing new methods/tools for social media mining

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 5 Some Challenges in Mining Social Media • Evaluaon Dilemma – How to evaluate without convenonal test data? • Sampling Bias – Oen we get a small sample of (sll big) data. How can we ensure if the data can lead to credible findings? • Noise-Removal Fallacy – How do we remove noise without losing too much? • Studying Distrust in Social Media – Is distrust simply the negaon of ? Where to find distrust informaon with “one-way” relaons?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 6 Evaluaon Dilemma

• Evaluaon is important in data mining – Tradional test data is oen not available in social media mining • Can we evaluate our findings without ground truth? • A case study of Migraon in Social Media – Users are a primary source of revenue – New social media sites need to aract users – Exisng sites need to retain their users – Compeon for precious aenon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 77 Migraon in Social Media

• What is migraon? – Migraon can be described as the movement of users away from one locaon toward another, either due to necessity, or aracon to the new environment. • Two types of migraon Site 2 Aer Site 2 Site 3 – me t Site migraon Site 1 Site 3

Site 2 Site 2 – Aenon migraon Aer me t Site 1 Site 3 Site 1 Site 3

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 8 Obtaining User Migraon Paerns

• Goal: Idenfying trends of aenon migraon of users across the two phases of the collected data. • Process

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 9 Paerns from Observaon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 10 Facing an Evaluaon Dilemma

• Important to know if they are some meaningful paerns – If yes, we invesgate further how we use the paerns for prevenon or promoon – If not, why not? And what can we do? • We would like to evaluate migraon paerns, but without ground truth • How? – User study or AMT?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 11 Evaluang Paerns’ Validity

• One way is to verify if these paerns are fortuitous • Null Hypothesis: Migraon in social media is a random process – Generang another similar dataset for comparison • Potenal migrang populaon includes overlapping users from Phase 1 and Phase 2 • Shuffled datasets are generated by picking random acve users from the potenal migrang populaon • The number of random users selected for each dataset is the same as the real migrang populaon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 12 A Significance Test

Coefficients Shuffled dataset Logisc Regression of user aributes

Comparing and Chi Square Stasc Sig. Test

Observed Coefficients migraon Logisc Regression of user dataset aributes

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 13 Evaluaon Results

• Significant differences observed in StumbleUpon, Twier, and YouTube • Paerns from other sites are not stascally significant. Potenal cause: – Insufficient Data?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 14 Summary

• Migang or promong migraon by targeng high net-worth individuals – Idenfying users with high value to the network, e.g., high network acvity, user acvity, and external exposure • Social media migraon is first studied in this work • Alternave evaluaon approaches can help address the evaluaon dilemma

Understanding User Migraon Paerns in Social Media, S. Kumar, R. Zafarani, and H. Liu, AAAI’2010

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 15 Some Challenges in Mining Social Media

• Evaluaon Dilemma

• Sampling Bias

• Noise-Removal Fallacy

• Studying Distrust in Social Media

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 16 Sampling Bias in Social Media Data

• Twier provides two main outlets for researchers to access tweets in real me: – Streaming API (~1% of all public tweets, free) – Firehose (100% of all public tweets, costly) • Streaming API data is oen used to by researchers to validate hypotheses. • How well does the sampled Streaming API data measure the true acvity on Twier?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 1717 Facets of Twier Data

• Compare the data along different facets • Selected facets commonly used in social media mining: – Top Hashtags – Topic Extracon – Network Measures – Geographic Distribuons

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 18 Preliminary Results

Top Hashtags Topic Extracon

• No clear correlaon between • Topics are close to those Streaming and Firehose data. found in the Firehose.

Network Measures Geographic Distribuons • Found ~50% of the top • Streaming data gets >90% of tweeters by different the geotagged tweets. centrality measures. • Consequently, the distribuon • Graph-level measures give of tweets by connent is very similar results between the similar. two datasets.

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 19 How are These Results?

• Accuracy of streaming API can vary with analysis to be performed • These results are about single cases of streaming API • Are these findings significant, or just an arfact of random sampling? • How do we verify that our results indicate sampling bias or not?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 20 Histogram of JS Distances in Topic Comparison

• This is just one streaming dataset against Firehose • Are we confident about this set of results? • Can we leverage another streaming dataset? • Unfortunately, we cannot rewind as we have only one streaming dataset

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 21 Verificaon

• Created 100 of our own “Streaming API” results by sampling the Firehose data.

Genera2ng&Random&Samples& 60"

50"

40"

30"

Numer&of&tweets&(k)& 20"

10"

0" Firehose" Streaming" Random"1" Random"2" …" Random"100"

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 22 Comparison with Random Samples

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 23 Summary

• Streaming API data could be biased in some facets • Our results were obtained with the help of Firehose • Without Firehose data, it’s challenging to figure out which facets might have bias, and how to compensate them in search of credible mining results

F. Morstaer, J. Pfeffer, H. Liu, and K. Carley. Is the Sample Good Enough? Comparing Data from Twier’s Streaming API and Data from Twier’s Firehose. ICWSM, 2013.

Fred Morstaer, Jürgen Pfeffer, Huan Liu. When is it Biased? Assessing the Representaveness of Twier's Streaming API, WWW Web Science 2014.

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 24 Some Challenges in Mining Social Media

• Evaluaon Dilemma

• Sampling Bias

• Noise-Removal Fallacy

• Studying Distrust in Social Media

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 25 Noise Removal Fallacy

• We oen learn that: “99% Twier data is useless.” – “Had eggs, sunny-side-up, this morning” – Can we remove noise as we usually do in DM? • What is le aer noise removal? – Twier data can be rendered useless aer convenonal noise removal • As we are certain there is noise in data, how can we remove it?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 2626 Social Media Data

• Massive and high-dimensional social media data poses unique challenges to data mining tasks – Scalability

– Curse of dimensionality • Social media data is inherently linked – A key difference between social media data and aribute-value data

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 27 Feature Selecon of Social Data

• Feature selecon has been widely used to prepare large-scale, high-dimensional data for effecve data mining • Tradional feature selecon deal with only “flat" data (aribute-value data). – Independent and Idencally Distributed (i.i.d.) • We need to take advantage of linked data for feature selecon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 28

Representaon for Social Media Data

​�↓� ​�↓� …. …. …. …. ​�↓1 ​�↓1 ​�↓2 ​�↓1 ​�↓2 ​�↓3 ​�↓4

​�↓2 ​�↓4 1 ​�↓5 1 1 1 ​�↓6 1 ​�↓3 ​�↓7 1 1 ​�↓4 ​�↓8

User-post relaons

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 29 Representaon for Social Media Data

​�↓� ​�↓� …. …. …. …. ​�↓1 ​�↓1 ​�↓2 ​�↓1 ​�↓2 ​�↓3 ​�↓4

​�↓2 ​�↓4 1 ​�↓5 1 1 1 ​�↓6 1 ​�↓3 ​�↓7 1 1 ​�↓4 ​�↓8

User-user relaons

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 30 Representaon for Social Media Data

​�↓� ​�↓� …. …. …. …. ​�↓1 ​�↓1 ​�↓2 ​�↓1 ​�↓2 ​�↓3 ​�↓4

​�↓2 ​�↓4 1 ​�↓5 1 1 1 ​�↓6 1 ​�↓3 ​�↓7 1 1 ​�↓4 ​�↓8

Social Context

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 31 Problem Statement

• Given labeled data X and its label indicator matrix Y, the dataset F, its social context including user-user following relaonships S and user-post relaonships P, • Select k most relevant features from m features on dataset F with its social context S and P

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 32 How to Use Link Informaon

• The new queson is how to proceed with addional informaon for feature selecon • Two basic technical problems – Relaon extracon: What are disncve relaons that can be extracted from linked data – Mathemacal representaon: How to use these relaons in feature selecon formulaon • Do we have theories to guide us?

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 33 Relaon Extracon

​�↓4 ​�↓8

​�↓7

​�↓1 ​�↓3 ​�↓6

​�↓2 ​�↓1 ​�↓2 1. CoPost 2. CoFollowing ​�↓5 p3 ​�↓4 3. CoFollowed 4. Following

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 34 Relaons, Social Theories, Hypotheses

• Social correlaon theories suggest that the four relaons may affect the relaonships between posts • Social correlaon theories – : People with similar interests are more likely to be linked – Influence: People who are linked are more likely to have similar interests • Thus, four relaons lead to four hypotheses for verificaon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 35 Modeling CoFollowing Relaon

• Two co-following users have similar topics of interests

Users' topic interests

T T( fi ) W fi ^ ∑ ∑ fi∈Fk fi∈Fk T(uk)= = | Fk | | Fk |

^ ^ T 2 2 min || X W − Y ||F +α || W ||2,1 +β || T(ui ) −T(u j ) ||2 W ∑ ∑ u ui ,u j∈Nu

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 36 Evaluaon Results on Digg

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 37 Evaluaon Results on Digg

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 38 Summary

• LinkedFS is evaluated under varied circumstances to understand how it works. – Link informaon can help feature selecon for social media data. • Unlabeled data is more oen in social media, is more sensible, but also more challenging.

Jiliang Tang and Huan Liu. `` Unsupervised Feature Selecon for Linked Social Media Data'', the Eighteenth ACM SIGKDD Internaonal Conference on Knowledge Discovery and Data Mining , 2012.

Jiliang Tang, Huan Liu. ``Feature Selecon with Linked Data in Social Media'', SIAM Internaonal Conference on Data Mining, 2012.

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 39 Some Challenges in Mining Social Media

• Evaluaon Dilemma

• Sampling Bias

• Noise-Removal Fallacy

• Studying Distrust in Social Media

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 40 Studying Distrust in Social Media

Introducon

Represenng Summary Trust

Trust in Social Compung

Measuring Incorporang Trust Distrust WWW2014 Tutorial on Trust in Social Compung Applying Trust Seoul, South Korea. 4/7/14 hp://www.public.asu.edu/~jtang20/tTrust.htm Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 4141 Distrust in Social Sciences

• Distrust can be as important as trust

• Both trust and distrust help a decision maker reduce the uncertainty and vulnerability associated with decision consequences

• Distrust may play an equally important, if not more, crical role as trust in consumer decisions

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 42 Understandings of Distrust from Social Sciences

• Distrust is the negaon of trust ─ Low trust is equivalent to high distrust ─ The absence of distrust means high trust ─ Lack of the studying of distrust maers lile

• Distrust is a new dimension of trust ─ Trust and distrust are two separate concepts ─ Trust and distrust can co-exist ─ A study ignoring distrust would yield an incomplete esmate of the effect of trust

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 43 Distrust in Social Media • Distrust is rarely studied in social media • Challenge 1: Lack of computaonal understanding of distrust with social media data – Social media data is based on passive observaons – Lack of some informaon social sciences use to study distrust • Challenge 2: Distrust informaon is usually not publicly available – Trust is a desired property while distrust is an unwanted one for an online social community

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 44 Computaonal Understanding of Distrust • Design computaonal tasks to help understand distrust with passively observed social media data § Task 1: Is distrust the negaon of trust? – If distrust is the negaon of trust, distrust should be predictable from only trust

§ Task 2: Can we predict trust beer with distrust? – If distrust is a new dimension of trust, distrust should have added value on trust and can improve trust predicon • The first step to understand distrust is to make distrust computable by incorporang distrust in

Arizona State University Some Challenging Problems in Mining Social Media trust models Data Mining and Machine Learning Lab April 22, 2014 45 Distrust in Trust Representaons There are three major ways to incorporate distrust in trust representaon – Considering low trust as distrust – Adding signs to trust values – Adding a dimension in trust representaons

Trust Trust Trust 1 1 1

0 Distrust 0 0 Distrust -1

-1 Distrust

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 46 An Illustraon of Distrust in Trust Representaons 0.8 • Considering low trust as distrust ─ Weighted unsigned network 1 1

0

• Extending negave values 0.8 ─ Weighted signed network 1 1 -1 (0.8,0)

(1,0) • Adding another dimension (1,0) (0,1) ─ Two-dimensional unsigned network

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 47 Task 1: Is Distrust the Negaon of Trust? • If distrust is the negaon of trust, low trust is equivalent to distrust and distrust should be predictable from trust

IF Distrust ≡ Low Trust

Predicng Predicng

THEN Distrust ≡ Low Trust

• Given the transivity of trust, we resort to trust predicon algorithms to compute trust scores for pairs of users in the same trust network

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 48 Evaluaon of Task 1 § The performance of using low trust to predict distrust is consistently worse than randomly guessing § Task 1 fails to predict distrust with only trust; and distrust is not the negaon of trust

dTP: It uses trust propagaon to calculate trust scores for pairs of users dMF: It uses the matrix factorizaon based predictor to compute trust scores for pairs of users dTP-MF: It is the combinaon of dTP and dMF using OR

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 49 Task 2: Can we predict Trust beer with Distrust

§ If distrust is not the negaon of trust, distrust should provide addional informaon about users, and could have added value beyond trust

§ We seek answer to whether using both trust and distrust informaon can help achieve beer performance than using only trust informaon

§ We can add distrust propagaon in trust propagaon to incorporate distrust

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 50 Evaluaon of Trust and Distrust Propagaon § Incorporang distrust propagaon into trust propagaon can improve the performance of trust measurement § One step distrust propagaon usually outperforms mulple step distrust propagaon

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 51 Concluding Remarks

• Evaluaon Dilemma • Sampling Bias in Social Media Data • Noise Removal Fallacy • Studying Distrust in Social Media

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 5252 THANKS to …

• Organizers for this wonderful opportunity to share our research work • Acknowledgments – Grants from NSF, ONR, ARO – DMML members and project leaders – Collaborators

Shamanth Kumar Ali Abbasi Reza Zafarani Fred Morstaer Jiliang Tang

Arizona State University Data Mining and Machine Learning Lab Some Challenging Problems in Mining Social Media April 22, 2014 5353