Mining Query Subtopics from Search Log Data • Summary
Total Page:16
File Type:pdf, Size:1020Kb
What to Mine from Big Data? Hang Li Noah’s Ark Lab Huawei Technologies Big Data Value Two Main Issues in Big Data Mining Agenda • Four Principles for “What to Mine” • Stories regarding to Principles – Search and Browse Log Mining as Example • Our Work on Big Data Mining – Mining Query Subtopics from Search Log Data • Summary Four Principles for “What to Mine” 1. Identifying scenarios of mining as much as possible 2. Logging as much data as possible 3. Integrating as much data as possible 4. ‘Understanding’ data as much as possible Identifying scenarios of mining as much as possible Immanuel Kant The world as we know it is our interpretation of the observable facts in the light of theories that we ourselves invent Example of Bad Design of Toolbar • A toolbar developed at a search engine • It recorded user’s search behavior data • However, • It did not record the time at which the user closed browser • No indication of end of session Logging as much data as possible Examples of Useful Log Information • User moves mouse on screen (user may unconsciously put mouse on focused area) – may infer users’ interest on the page • User uses mouse to scroll up and down – may infer whether user is serious about page content (more scrolling suggests more seriousness) • User clicks on next page – may infer user’s current focus • User closes browser window/tab – may infer user’s current focus Integrating as much data as possible Model of User Search Behavior • Data needs to be collected from different sources (toolbar, search engine log) • E.g., toolbar usually does not record search results • Often challenging to integrate data Understanding Data as Much as Possible AOL Search Data Leak (2006) • AOL search data release (20M queries, 650K users, 3 months) • New York Times article “A Face Is Exposed for AOL Searcher No. 4417749” • Queries – “landscapers in Lilburn, Ga” – several people with the last name Arnold – “homes sold in shadow lake subdivision gwinnett county georgia.” – ''dog that urinates on everything” – “60 single men” • Identified searcher is Thelma Arnold, a widow living in Georia Mining Query Subtopics from Search Log Data Yunhua Hu, Yanan Qian1, Hang Li, Daxin Jiang, Jian Pei2, and Qinghua Zheng1 Microsoft Research Asia, Beijing, China 1 SPKLSTN Lab, Xi'an Jiaotong University, China 2 Simon Fraser University, Burnaby, BC, Canada Outline • Introduction • Our Method • Experiments • Conclusion 16 Demo Mined Subtopics Subtopics of Query • Most queries are ambiguous or multifaceted in web search XBox games Harry Shum Microsoft XBox Harry XBox Shu m homepage Harry Shum Jr XBox marketplace • Major senses and facets of query (subtopics) 21 Our Work = Automatically Mining Subtopics of Queries from Search Log Data Phenomenon 1: One Subtopic per Search (OSS) Query Multi-Clicked URLs (Multi-Clicks) Frequency "http://research.microsoft.com/en-us/people/hshum, 50 http://en.wikipedia.org/wiki/Harry_Shum, "Harry Shum" http://www.microsoft.com/presspass/exec/Shum/" "http://en.wikipedia.org/wiki/Harry_Shum,_Jr, 95 http://www.washingtonpost.com/.../VI2011022701183.html" Jointly Clicked URLs in the same searches tend to represent the same subtopics Phenomenon 2: Subtopic Clarification by Additional Keyword (SCAK) Query Clicked URLs "Harry Shum" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://en.wikipedia.org/wiki/Harry_Shum", "http://www.washingtonpost.com/.../VI2011022701183.html" "http://www.microsoft.com/presspass/exec/Shum/" “Microsoft Harry Shum" "http://research.microsoft.com/en-us/people/hshum", "http://en.wikipedia.org/wiki/Harry_Shum", “http://www.microsoft.com/presspass/exec/Shum/” "Harry Shum Jr" "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://www.washingtonpost.com/.../VI2011022701183.html" "Harry Shum Glee” "http://en.wikipedia.org/wiki/Harry_Shum,_Jr", "http://www.washingtonpost.com/.../VI2011022701183.html" URLs clicked in searches of the query and its expanded queries tend to represent the same subtopics. Outline • Introduction • Our Method • Experiments • Conclusion 25 Our Approach • Mining subtopics of queries by leveraging the two phenomena • Subtopics of query are represented by – URLs – Keywords in expanded queries • Example of subtopic Subtopi Keywords (in bold face) URLs 1 “harry shum microsoft” “http://en.wikipedia.org/wiki/Harry_Shum” “harry shum bing” “http://research.microsoft.com/en-us/people/hshum/” “microsoft harry shum” “http://www.microsoft.com/presspass/exec/Shum/” 2 “harry shum jr” “http://en.wikipedia.org/wiki/Harry_Shum,_Jr.” “harry shum glee” “http://harryshumjr.com/” “harry shum junior” “http://www.imdb.com/name/nm1484270/” 26 Flow of Clustering Method 27 Preprocessing • Tree structure to index queries (‘Q+W’ and ‘W+Q’ for ‘Q’) • Pruning: Only keep expanded queries with URL overlap 28 Similarity Calculation between URLs S1: Similarity basedURLs on OSS Multi- Multi- Multi- Click1 Click2 Click3 S2: Similarity based on SCAK "http://en.wikipedia.org/wiki/Harry_Shum" 4 3 0 S3: Similarity between URL tokens "http://www.microsoft.com/presspass/exec/Shum/" 4 0 3 … … … … N/A N/A … 0.64 N/A … N/A N/A N/A N/A 0.96 N/A Similarity Matrix of S1 Similarity Matrix of S2 URLs “Jr” “Glee” “Microsoft” "http://en.wikipedia.org/wiki/Harry_Shum,_Jr" 3 4 0 "http://www.imdb.com/name/nm1484270/" 4 3 0 Clustering Algorithm • Agglomerative clustering algorithm – Two URLs are similar if the similarity is larger than a threshold – Each maximum connected subgraph (a group of urls) represents a subtopic • Algorithm is efficient and easy to implement 30 Outline • Introduction • Our Method • Experiments • Conclusion 31 Data Set and Parameter Setting • One open dataset + two proprietary datasets • Evaluation metric: B-cubed precision, recall, and F1 • Manually tune the parameters in 1/3 of DataSetA 32 Evaluation of Subtopic Mining • Evaluation on different similarity functions • Evaluation on different types of queries 33 Application in Search Result Clustering (1) • Search result clustering approaches – Baseline: Wang and Zhai’s work in SIGIR 07 – Our approach: "subtopics of query as seed clusters" + traditional URL clustering • Evaluation on TREC and DataSetA 34 Application in Search Result Clustering (2) • Manual evaluation on DataSetB from various perspectives • Side-by-side evaluation on DataSetB 35 Application in Search Results Re-ranking (1) 36 Application in Search Results Re-ranking (2) 37 Outline • Introduction • Our Method • Experiments • Conclusion 38 Conclusion • Discovered two phenomena in search log data to represent query subtopics • Developed a clustering method for subtopic mining • Applied the mined subtopics into two tasks: search result clustering and re-ranking 39 Strength and Limitation of Big Data Mining • Big data really creates big value • Importance of insight • Log tail challenges • Mining needs knowledge 40 Summary • Two Major Issues: What to Mine and How to Mine • Four Principles for “What to Mine” • Stories regarding to Principles – Search and Browse Log Mining as Example • Our Work on Big Data Mining – Mining Query Subtopics from Search Log Data Thanks! [email protected] 42 .