Finding People and Their Utterances in Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Finding People and their Utterances in Social Media Wouter Weerkamp Finding People and their Utterances in Social Media ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op dinsdag 18 oktober 2011, te 12:00 uur door Wouter Weerkamp geboren te Den Helder, Nederland Promotiecommissie Promotor: Prof. dr. M. de Rijke Overige leden: Prof. dr. H. L. Hardman Dr. C. Monz Prof. dr. D. W. Oard Prof. dr. A. P. de Vries Faculteit der Natuurwetenschappen, Wiskunde en Informatica SIKS Dissertation Series No. 2011-23 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. The research was supported by the Center for Creation, Content and Technology (CCCT) and under COMMIT project Infiniti. Copyright c 2011 Wouter Weerkamp, Amsterdam, The Netherlands Cover by Mark Assies Printed by: Off Page, Amsterdam ISBN: 978-94-6182-023-5 Het boek is af! En zonder deze mensen zou het een stuk moeilijker geweest zijn. Henk en Wieke Voor alle steun en vrijheid tijdens mijn studie en promotie. Marijke, Mark, Marnix, Matthijs, Sietse en vooral Hanneke Voor de nodige afleiding. Maarten Voor de ideeen,¨ motivatie en sturing. Edgar, Manos, and Simon For all idea-generating sessions in Polder. And all other (former) ILPSers For working together and enjoying the coffee breaks. Finally, I would like to thank the committee members For their valuable input and comments. Contents 1 Introduction 1 1.1 Information in Social Media . 2 1.2 Research Outline and Questions . 4 1.3 Main Contributions . 9 1.4 Thesis Overview . 9 1.5 Origins . 11 2 Background 13 2.1 Information Retrieval . 13 2.2 Query Log Analysis . 15 2.2.1 Queries . 16 2.2.2 Sessions . 17 2.2.3 Users . 17 2.3 Blogger Finding . 18 2.4 Credibility in Web Settings . 19 2.4.1 Credibility in social media . 20 2.5 Query Modeling . 21 2.5.1 External query expansion . 21 2.6 Email Search . 22 3 Experimental Methodology 25 3.1 Test Collections and Tasks . 25 3.1.1 Blog post collection . 26 3.1.2 Email collection . 27 3.2 Evaluation . 28 3.2.1 Evaluation metrics . 28 3.2.2 Significance testing . 29 3.3 Baseline Retrieval Model . 30 4 Searching for People 31 4.1 Transaction Objects . 32 4.2 Search System and Data . 33 4.2.1 Query logs . 34 4.2.2 Query characteristics . 35 4.2.3 Session characteristics . 37 4.2.4 User characteristics . 38 4.2.5 Out click characteristics . 39 4.3 Object Classifications . 41 4.3.1 Queries . 42 4.3.2 Sessions . 47 4.3.3 Users . 49 4.4 Discussion and Implications . 49 4.5 Summary and Conclusions . 53 v CONTENTS 5 Finding Bloggers 57 5.1 Probabilistic Models for Blog Feed Search . 59 5.1.1 Blogger model . 60 5.1.2 Posting model . 61 5.1.3 A two-stage model . 61 5.2 Experimental Setup . 62 5.2.1 Topic sets . 63 5.2.2 Inverted indexes . 64 5.2.3 Smoothing . 65 5.3 Baseline Results . 65 5.3.1 Language detection . 66 5.3.2 Short blogs . 66 5.3.3 Baseline results . 67 5.3.4 Analysis . 67 5.3.5 Intermediate conclusions . 69 5.4 A Two-Stage Model for Blog Feed Search . 70 5.4.1 Motivation . 71 5.4.2 Estimating post importance . 72 5.4.3 Pruning the single stage models . 73 5.4.4 Evaluating the two-stage model . 75 5.4.5 A further reduction . 76 5.4.6 Per-topic analysis of the two-stage model . 77 5.4.7 Intermediate conclusions . 79 5.5 Analysis and Discussion . 80 5.5.1 Efficiency vs. effectiveness . 81 5.5.2 Very high early precision . 81 5.5.3 Smoothing parameter . 82 5.6 Summary and Conclusions . 83 6 Credibility-Inpsired Ranking for Blog Post Retrieval 85 6.1 Credibility Framework . 88 6.2 Credibility-Inspired Indicators . 90 6.2.1 Post-level indicators . 90 6.2.2 Blog-level indicators . 93 6.3 Experimental Setup . 95 6.4 Results . 96 6.4.1 Baseline and spam filtering . 96 6.4.2 Credibility-inspired reranking . 97 6.4.3 Combined reranking . 99 6.5 Analysis and Discussion . 101 6.5.1 Spam classification . 101 6.5.2 Changes in ranking . 101 6.5.3 Per topic analysis . 107 6.5.4 Impact of parameters on precision . 111 6.5.5 Credibility-inspired ranking vs. relevance ranking . 112 6.6 Summary and Conclusions . 114 vi CONTENTS 7 Exploiting the Environment in Blog Post Retrieval 117 7.1 Query Modeling using External Collections . 119 7.1.1 Instantiating the External Expansion Model . 121 7.2 Estimating Model Components . 122 7.2.1 Prior collection probability . 122 7.2.2 Document relevance . 123 7.2.3 Collection relevance . 123 7.2.4 Document importance . 123 7.2.5 Term probability . 124 7.3 Experimental Setup . 124 7.3.1 External collections . 124 7.3.2 Parameters . 125 7.4 Results . 125 7.4.1 Individual collections . 126 7.4.2 Combination of collections . 126 7.5 Analysis and Discussion . 127 7.5.1 Per-topic analysis . 128 7.5.2 Influence of (query-dependent) collection importance . 134 7.5.3 Impact of parameter settings . 137 7.6 Summary and Conclusions . 137 8 Using Contextual Information for Email Finding 141 8.1 Baseline Retrieval Approach . 142 8.2 Email Contexts . 142 8.2.1 Query modeling from contexts . 144 8.2.2 Parameter estimation . 144 8.3 Results of Incorporating Contexts . 144 8.4 Credibility-Inspired Ranking in Email Search . 146 8.5 Results of Credibility-Inspired Ranking . 148 8.6 Analysis and Discussion . 150 8.6.1 Query models from context . 150 8.6.2 Credibility-inspired ranking . 152 8.7 Summary and Conclusions . 153 9 Conclusions 155 9.1 Main Findings . 155 9.2 Future Research Directions . 158 Bibliography 161 Samenvatting 173 vii 1 Introduction The initial explosive growth of the web, now often referred to as Web 1.0 [145], led to a huge increase in information available online: companies created their own web presence, newspapers began offering news articles to readers online, governments started to inform their citizens using websites, and many more organizations allowed online users to find at least the most basic information online. Two main characteristics of this initial information boom are (i) the content creators (webmasters or online editors) were specialized positions within organizations and (ii) the involvement of web users was mainly restricted to consuming information. Starting in the twenty first century, the web experienced another phase of explosive growth and.