Geographic Information Retrieval: Classification, Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
Imperial College London Department of Computing Geographic Information Retrieval: Classification, Disambiguation and Modelling Simon E Overell Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, July 2009 1 2 Abstract This thesis aims to augment the Geographic Information Retrieval process with information extracted from world knowledge. This aim is approached from three directions: classifying world knowledge, disambiguating placenames and modelling users. Geographic information is becoming ubiquitous across the Internet, with a significant proportion of web documents and web searches containing geographic entities, and the proliferation of Internet enabled mobile devices. Traditional information retrieval treats these geographic entities in the same way as any other textual data. In this thesis I augment the retrieval process with geographic information, and show how methods built upon world knowledge outperform methods based on heuristic rules. The source of world knowledge used in this thesis is Wikipedia. Wikipedia has become a phenomenon of the Internet age and needs little introduction. As a linked corpus of semi-structured data, it is unsurpassed. Two approaches to mining information from Wikipedia are rigorously explored: initially I classify Wikipedia articles into broad categories; this is followed by much finer classification where Wikipedia articles are disambiguated as specific locations. The thesis concludes with the proposal of the Steinberg hypothesis: By analysing a range of wikipedias in different languages I demonstrate that a localised view of the world is ubiquitous and inherently part of human nature. All people perceive closer places as larger and more important than distant ones. The core contributions of this thesis are in the areas of extracting information from Wikipedia, supervised placename disambiguation, and providing a quantitative model for how people view the world. The findings clearly have a direct impact for applications such as geographically aware search engines, but in a broader context documents can be automatically annotated with machine readable meta-data and dialogue enhanced with a model of how people view the world. This will reduce ambiguity and confusion in dialogue between people or computers. 3 4 Acknowledgements In the same way that far more is produced during a PhD than a thesis, it is a journey taken by many more people than just the author. I would like to take this opportunity to thank the people that have accompanied me on this journey. Firstly I would like to thank my supervisor, Stefan R¨uger, for his help, advice and support over • the past three years. My examiners, Mark Sanderson and Julie McCann, for agreeing to scrutinise my work. • EPSRC for funding my work and helping me remain a student for three more years. • My colleagues in the Multimedia and Information Systems group, with whom I shared many a cup • of tea and lazy afternoon: Jo˜ao, Alexei, Peter, Paul, Ed and Daniel from Imperial College London, and Ainhoa, Haiming, Jianhan, Adam and Rui from the Open University. Roelof van Zwol who gave me the opportunity to spend a summer in Barcelona where I failed • spectacularly to learn any Spanish. I would also like to thank B¨okur, Georgina, Llu´ıs and my colleagues at Yahoo! Research Barcelona, • and my flat mates Claudia and Gleb (who, to my knowledge, also failed to learn Spanish). As well as Barcelona, I have had the opportunity to travel both nationally and internationally as • part of my PhD. I would like to thank the people who have made this possible: Ali Azimi Bolourian for hosting my visit to Glasgow, Cathal Gurrin for hosting my visit to Dublin, Jo˜ao Magalh˜aes and Jos´eIria for hosting my visit to Sheffield, MMKM for partially funding my trips to Seattle and Alicante, and Google for partially funding my return visit to Barcelona. A chance meeting with Leif Azzopardi at ECIR led me to joining the BCS-IRSG committee. In • turn, this led me to a series of fascinating talks and introduced me to many members of the IR community. I would like to thank Leif, Ali, Andy and the rest of the BCS-IRSG committee. For the past three years I have lived in a cozy corner of room 433 of Imperial College London’s • Huxley building. A series of people have shared this room with me and joined in often quite heated debates. This includes Uri, Sergio, Georgia, Alok, Mohammed, Mark, Adam, Jonathan, Clemens, Daniel and Nick. Before terms such as Credit Crunch, Down Turn and Recession were common place, I was ap- • proached by Evgeny Shadchnev to found a company. I learnt a huge amount during the next six months and I would like to thank Evgeny for the much needed distraction from my thesis. Unfor- tunately by the time we were approaching investors, beta-release and business-plan in hand, the credit bubble had burst and the venture-capital ships had sailed. 5 PhD, or in full Philosophiæ Doctor, litteral translation is “teacher of philosophy”. This brings me • to the next great distraction from my thesis: teaching the third-year Robotics course. I would like to thank Keith and Andy, who gave me the opportunity to be paid to teach undergraduates how to play with Lego for three months a year. If a PhD is a journey, the person who gave me a map was George Mallen. When I worked at SSL • he showed me what industrial research was and introduced me to Information Retrieval. I would like to thank the many friends that have surrounded me for the past three years. They • have bought me drinks and listened to tales of placename disambiguation. They include Dave, Dom, Steve, Tim, Lewis, Kurt, Alec, Sacha, Mark, Olly, Dan, Ash and far too many more to list. During the final three months of my write up, I developed crippling back pain. I would like to • thank Stewart and Melanie for helping me walk tall and sit up straight. Without their help I would not have finished this thesis. My parents, John and Yvonne, have always supported me to pursue whatever avenue has interested • me. I would like to thank them. Despite wanting to live in a farmhouse in the country, my girlfriend Jancy has shared a flat with • me, slightly too small to be cozy, in Fulham for the past three years. I thank her from the bottom of my heart for her endless love and support. 6 Contents Abstract 3 Acknowledgements 5 Contents 7 1 Introduction 15 1.1 GIR,IRandGIS..................................... 15 1.2 Placenames and locations . 16 1.3 The need for geographic indexing . 16 1.4 The need for placename disambiguation . 17 1.5 GIRsystems ....................................... 18 1.6 Wikipedia......................................... 19 1.7 Scope ............................................. 20 1.8 Contributions...................................... ..... 21 1.9 Publications........................................ 22 1.10Implementations .................................... ..... 23 1.11Organisation ..................................... ...... 24 1.12Roadmap ......................................... 25 2 GIR and Wikipedia 27 2.1 Introduction....................................... ..... 27 2.2 FromIRtoGIR ..................................... 27 2.2.1 Relevance......................................... 27 2.2.2 Phrases and named entities . 28 2.3 GIR ............................................. 29 2.3.1 GIRsystems ..................................... 30 2.3.2 Placename disambiguation . 30 2.3.3 Geographic indexing . 33 2.3.4 Geographic relevance . 35 2.3.5 Combining text and geographic relevance . 37 2.4 Geographic resources . 39 2.5 Wikipedia......................................... 40 2.5.1 Wikipediaarticles ................................. 41 2.5.2 Wikipedians...................................... 41 2.5.3 Accuracy ......................................... 42 7 2.5.4 Mining semantic information . ..... 43 2.6 Discussion........................................ ..... 44 3 Evaluation and metrics 45 3.1 Introduction....................................... ..... 45 3.2 EvaluatingIRsystems ................................ ...... 45 3.2.1 Evaluationforums ................................ 45 3.3 Evaluationmeasures .................................. ..... 47 3.3.1 Binary classification and unordered retrieval . 47 3.3.2 Scored classification and ranked retrieval . 48 3.4 Statisticaltesting.................................. ....... 49 3.4.1 TheSigntest...................................... 50 3.4.2 The Wilcoxon Signed-Rank test . 50 3.4.3 TheFriedmantest ................................... 51 3.4.4 The Student’s t-test . 51 3.4.5 One-tailed vs. two-tailed . 52 3.5 Discussion........................................ ..... 52 4 Classifying Wikipedia articles 55 4.1 Introduction....................................... ..... 55 4.2 Classification and disambiguation of Wikipedia articles . ........... 55 4.3 Classifying Wikipedia articles . ......... 57 4.3.1 Groundtruth...................................... 58 4.3.2 Sparsityofdata................................... 58 4.3.3 Removingnoise.................................... 60 4.3.4 Systemoptimisation ............................... 60 4.4 Evaluation and comparison to existing methods . ....... 62 4.4.1 Experimentalsetup................................. 63 4.4.2 Results ......................................... 64 4.5 Acasestudy:Flickr .................................. ..... 65 4.5.1 Atagclassificationsystem. ..... 67 4.5.2 Coverage ....................................... 69 4.5.3 Summary