What to Mine from Big Data?

Hang Li Noah’s Ark Lab Huawei Technologies Big Data

Value Two Main Issues in Big Data Mining

Agenda

• Four Principles for “What to Mine” • Stories regarding to Principles – Search and Browse Log Mining as Example • Our Work on Big Data Mining – Mining Query Subtopics from Search Log Data • Summary

Four Principles for “What to Mine”

1. Identifying scenarios of mining as much as possible 2. Logging as much data as possible 3. Integrating as much data as possible 4. ‘Understanding’ data as much as possible Identifying scenarios of mining as much as possible Immanuel Kant

The world as we know it is our interpretation of the observable facts in the light of theories that we ourselves invent Example of Bad Design of Toolbar

• A toolbar developed at a • It recorded user’s search behavior data • However, • It did not record the time at which the user closed browser • No indication of end of session Logging as much data as possible Examples of Useful Log Information

• User moves mouse on screen (user may unconsciously put mouse on focused area) – may infer users’ interest on the page • User uses mouse to scroll up and down – may infer whether user is serious about page content (more scrolling suggests more seriousness) • User clicks on next page – may infer user’s current focus • User closes browser window/tab – may infer user’s current focus

Integrating as much data as possible Model of User Search Behavior

• Data needs to be collected from different sources (toolbar, search engine log) • E.g., toolbar usually does not record search results • Often challenging to integrate data Understanding Data as Much as Possible AOL Search Data Leak (2006)

• AOL search data release (20M queries, 650K users, 3 months) • New York Times article “A Face Is Exposed for AOL Searcher No. 4417749” • Queries – “landscapers in Lilburn, Ga” – several people with the last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia.” – ''dog that urinates on everything” – “60 single men” • Identified searcher is Thelma Arnold, a widow living in Georia Mining Query Subtopics from Search Log Data

Yunhua Hu, Yanan Qian1, Hang Li, Daxin Jiang, Jian Pei2, and Qinghua Zheng1 Research Asia, Beijing, 1 SPKLSTN Lab, Xi'an Jiaotong University, China 2 Simon Fraser University, Burnaby, BC, Canada Outline

• Introduction • Our Method • Experiments • Conclusion

16 Demo Mined Subtopics

Subtopics of Query

• Most queries are ambiguous or multifaceted in web search games Harry Shum Microsoft XBox Harry XBox Shu m homepage Harry Shum Jr XBox marketplace • Major senses and facets of query (subtopics)

21 Our Work = Automatically Mining Subtopics of Queries from Search Log Data Phenomenon 1: One Subtopic per Search (OSS)

Query Multi-Clicked URLs (Multi-Clicks) Frequency "http://research.microsoft.com/en-us/people/hshum, 50 http://en.wikipedia.org/wiki/Harry_Shum, "Harry Shum" http://www.microsoft.com/presspass/exec/Shum/" "http://en.wikipedia.org/wiki/Harry_Shum,_Jr, 95 http://www.washingtonpost.com/.../VI2011022701183.html"

Jointly Clicked URLs in the same searches tend to represent the same subtopics Phenomenon 2: Subtopic Clarification by Additional Keyword (SCAK) Query Clicked URLs "Harry Shum" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://en.wikipedia.org/wiki/Harry_Shum", "http://www.washingtonpost.com/.../VI2011022701183.html" "http://www.microsoft.com/presspass/exec/Shum/" “Microsoft Harry Shum" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/Harry_Shum", “http://www.microsoft.com/presspass/exec/Shum/” "Harry Shum Jr" "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://www.washingtonpost.com/.../VI2011022701183.html" "Harry Shum Glee” "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://www.washingtonpost.com/.../VI2011022701183.html" URLs clicked in searches of the query and its expanded queries tend to represent the same subtopics. Outline

• Introduction • Our Method • Experiments • Conclusion

25 Our Approach • Mining subtopics of queries by leveraging the two phenomena • Subtopics of query are represented by – URLs – Keywords in expanded queries • Example of subtopic Subtopi Keywords (in bold face) URLs 1 “harry shum microsoft” “http://en.wikipedia.org/wiki/Harry_Shum” “harry shum bing” “http://research.microsoft.com/en-us/people/hshum/” “microsoft harry shum” “http://www.microsoft.com/presspass/exec/Shum/” 2 “harry shum jr” “http://en.wikipedia.org/wiki/Harry_Shum,_Jr.” “harry shum glee” “http://harryshumjr.com/” “harry shum junior” “http://www.imdb.com/name/nm1484270/” 26 Flow of Clustering Method

27 Preprocessing

• Tree structure to index queries (‘Q+W’ and ‘W+Q’ for ‘Q’)

• Pruning: Only keep expanded queries with URL overlap 28 Similarity Calculation between URLs

S1: Similarity basedURLs on OSS Multi- Multi- Multi- Click1 Click2 Click3 S2: Similarity based on SCAK "http://en.wikipedia.org/wiki/Harry_Shum" 4 3 0 S3: Similarity between URL tokens "http://www.microsoft.com/presspass/exec/Shum/" 4 0 3 … … … … N/A N/A … 0.64 N/A … N/A N/A N/A N/A 0.96 N/A

Similarity Matrix of S1 Similarity Matrix of S2 URLs “Jr” “Glee” “Microsoft” "http://en.wikipedia.org/wiki/Harry_Shum,_Jr" 3 4 0 "http://www.imdb.com/name/nm1484270/" 4 3 0 Clustering Algorithm

• Agglomerative clustering algorithm – Two URLs are similar if the similarity is larger than a threshold – Each maximum connected subgraph (a group of urls) represents a subtopic • Algorithm is efficient and easy to implement

30 Outline

• Introduction • Our Method • Experiments • Conclusion

31 Data Set and Parameter Setting

• One open dataset + two proprietary datasets

• Evaluation metric: B-cubed precision, recall, and F1 • Manually tune the parameters in 1/3 of DataSetA

32 Evaluation of Subtopic Mining • Evaluation on different similarity functions

• Evaluation on different types of queries

33 Application in Search Result Clustering (1)

• Search result clustering approaches – Baseline: Wang and Zhai’s work in SIGIR 07 – Our approach: "subtopics of query as seed clusters" + traditional URL clustering • Evaluation on TREC and DataSetA

34 Application in Search Result Clustering (2)

• Manual evaluation on DataSetB from various perspectives

• Side-by-side evaluation on DataSetB

35 Application in Search Results Re-ranking (1)

36 Application in Search Results Re-ranking (2)

37 Outline

• Introduction • Our Method • Experiments • Conclusion

38 Conclusion

• Discovered two phenomena in search log data to represent query subtopics

• Developed a clustering method for subtopic mining

• Applied the mined subtopics into two tasks: search result clustering and re-ranking

39 Strength and Limitation of Big Data Mining • Big data really creates big value • Importance of insight • Log tail challenges • Mining needs knowledge

40

Summary

• Two Major Issues: What to Mine and How to Mine • Four Principles for “What to Mine” • Stories regarding to Principles – Search and Browse Log Mining as Example • Our Work on Big Data Mining – Mining Query Subtopics from Search Log Data

Thanks! [email protected]

42