Improving the Utility of Social Media with Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

Improving the Utility of Social Media with Natural Language Processing View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by University of Melbourne Institutional Repository Improving the Utility of Social Media with Natural Language Processing A thesis presented by Bo Han to The Department of Computing and Information Systems in total fulfillment of the requirements for the degree of Doctor of Philosophy The University of Melbourne Melbourne, Australia February 2014 Produced on archival quality paper Declaration This is to certify that: (i) the thesis comprises only my original work towards the PhD except where indi- cated in the Preface; (ii) due acknowledgement has been made in the text to all other material used; (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Signed: Date: ⃝c 2014 - Bo Han All rights reserved. Thesis advisor(s) Author Timothy Baldwin Bo Han Paul Cook Improving the Utility of Social Media with Natural Language Processing Abstract Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into \earthquake" and \tomorrow", respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text nor- malisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accu- racy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particu- larly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user's primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text- based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors includ- Abstract iv ing non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations pro- vide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The devel- oped method and experimental results have immediate impact on future social media research. Contents Title Page .................................... i Abstract ..................................... iii Table of Contents ................................ v List of Figures .................................. viii List of Tables .................................. ix Citations to Previously Published Work ................... xii Acknowledgments ................................ xiii 1 Introduction 1 1.1 Background and Motivation ....................... 1 1.2 Aim and Scope .............................. 3 1.3 Contributions ............................... 6 1.4 Thesis Outline ............................... 9 2 Literature Review 13 2.1 Social Media ................................ 13 2.1.1 Twitter Data ........................... 16 2.1.2 Summary ............................. 21 2.2 Text Normalisation ............................ 21 2.2.1 Non-standard Words ....................... 22 2.2.2 Normalisation Task and Scope .................. 30 2.2.3 Methodologies ........................... 34 2.2.4 Recent Normalisation Approaches ................ 44 2.2.5 Summary ............................. 48 2.3 Geolocation Prediction .......................... 50 2.3.1 Background ............................ 50 2.3.2 Network-based Geolocation Prediction ............. 55 2.3.3 Text-based Geolocation Prediction ............... 57 2.3.4 Hybrid Methods .......................... 64 2.3.5 Summary ............................. 66 2.4 Literature Summary ........................... 67 v Contents vi 3 Text Normalisation 68 3.1 Normalisation Scope ........................... 68 3.2 Token-based Lexical Normalisation ................... 70 3.2.1 A Pilot Study on OOV Words .................. 70 3.2.2 Datasets and Evaluation Metrics ................ 73 3.2.3 Token-based Normalisation Approach .............. 74 3.2.4 Baselines and Benchmarks .................... 78 3.2.5 Results Analysis and Discussion ................. 79 3.2.6 Lexical Variant Detection .................... 83 3.2.7 Summary ............................. 88 3.3 Type-based Lexical Normalisation .................... 88 3.3.1 Motivation and Feasibility Analysis ............... 89 3.3.2 Word Type Normalisation .................... 90 3.3.3 Contextually Similar Pair Generation .............. 91 3.3.4 Pair Re-ranking by String Similarity .............. 95 3.3.5 Intrinsic Evaluation of Type-based Normalisation ....... 97 3.3.6 Error Analysis and Discussion .................. 103 3.4 Extrinsic Evaluation of Lexical Normalisation ............. 105 3.5 Non-English Text Normalisation ..................... 109 3.5.1 A Comparative Study on Spanish Text Normalisation ..... 110 3.5.2 Adapted Normalisation Approach ................ 111 3.5.3 Results and Discussion ...................... 114 3.6 Recent Progress on Text Normalisation ................. 117 3.6.1 Recent Normalisation Approaches ................ 117 3.6.2 Impact of Normalisation in Recent Research .......... 118 3.7 Summary ................................. 120 4 Geolocation Prediction 122 4.1 Introduction ................................ 122 4.2 Geolocation Prediction Framework ................... 126 4.2.1 Representation: Earth Grid vs. City .............. 126 4.2.2 Geolocation Prediction Models .................. 128 4.2.3 Feature Set ............................ 129 4.2.4 Data ................................ 131 4.2.5 Evaluation Measures ....................... 134 4.3 Finding Location Indicative Words ................... 135 4.3.1 Literature on Feature Selection ................. 136 4.3.2 Location Indicative Words .................... 137 4.3.3 Statistical-based Methods .................... 138 4.3.4 Information Theory-based Methods ............... 142 4.3.5 Heuristic-based Methods ..................... 144 4.4 Benchmarking Experiments on NA ................... 147 Contents vii 4.4.1 Comparison of Feature Selection Methods ........... 148 4.4.2 Comparison with Benchmarks .................. 150 4.5 Experiments on WORLD ......................... 155 4.6 Exploiting Non-geotagged Tweets .................... 157 4.7 Language Influence on Geolocation Predication ............ 160 4.8 Incorporating User Meta Data ...................... 167 4.8.1 Unlocking the Potential of User-declared Metadata ...... 168 4.8.2 Results of Metadata-based Classifiers .............. 169 4.8.3 Ensemble Learning on Text-based Classifiers .......... 171 4.9 Temporal Influence on Geolocation Model ............... 174 4.10 User Tweeting Behaviour ......................... 177 4.11 Prediction Confidence .......................... 180 4.12 Summary ................................. 183 5 Conclusion 185 5.1 Summary of Findings ........................... 185 5.2 Limitations and Future Work ...................... 192 5.3 A Tweet-length Summary of Thesis ................... 198 A Appendix 219 List of Figures 1.1 Examples of social media-based applications. .............. 2 1.2 A demonstration of geolocation prediction input and output for a pub- lic Twitter user. The red dot icon denotes the predicted city centre. 5 2.1 Word categorisations in text normalisation. ............... 31 2.2 From n-grams to a bipartite graph G = (W; C; E). ........... 44 3.1 Out-of-vocabulary word distribution in English Gigaword (NYT), Twit- ter and SMS data.
Recommended publications
  • Out of the Ordinary: Finding Hidden Threats by Analyzing Unusual
    RAND-INITIATED RESEARCH CHILD POLICY This PDF document was made available CIVIL JUSTICE from www.rand.org as a public service of EDUCATION the RAND Corporation. ENERGY AND ENVIRONMENT HEALTH AND HEALTH CARE Jump down to document6 INTERNATIONAL AFFAIRS NATIONAL SECURITY The RAND Corporation is a nonprofit POPULATION AND AGING research organization providing PUBLIC SAFETY SCIENCE AND TECHNOLOGY objective analysis and effective SUBSTANCE ABUSE solutions that address the challenges TERRORISM AND facing the public and private sectors HOMELAND SECURITY TRANSPORTATION AND around the world. INFRASTRUCTURE Support RAND Purchase this document Browse Books & Publications Make a charitable contribution For More Information Visit RAND at www.rand.org Explore RAND-Initiated Research View document details Limited Electronic Distribution Rights This document and trademark(s) contained herein are protected by law as indicated in a notice appearing later in this work. This electronic representation of RAND intellectual property is provided for non- commercial use only. Permission is required from RAND to reproduce, or reuse in another form, any of our research documents. This product is part of the RAND Corporation monograph series. RAND monographs present major research findings that address the challenges facing the public and private sectors. All RAND mono- graphs undergo rigorous peer review to ensure high standards for research quality and objectivity. Out of the Ordinary Finding Hidden Threats by Analyzing Unusual Behavior JOHN HOLLYWOOD, DIANE SNYDER, KENNETH McKAY, JOHN BOON Approved for public release, distribution unlimited This research in the public interest was supported by RAND, using discretionary funds made possible by the generosity of RAND's donors, the fees earned on client-funded research, and independent research and development (IR&D) funds provided by the Department of Defense.
    [Show full text]
  • Private Companiescompanies
    PrivatePrivate CompaniesCompanies Private Company Profiles @Ventures Kepware Technologies Access 360 Media LifeYield Acronis LogiXML Acumentrics Magnolia Solar Advent Solar Mariah Power Agion Technologies MetaCarta Akorri mindSHIFT Technologies alfabet Motionbox Arbor Networks Norbury Financial Asempra Technologies NumeriX Asset Control OpenConnect Atlas Venture Panasas Autonomic Networks Perimeter eSecurity Azaleos Permabit Technology Azimuth Systems PermissionTV Black Duck Software PlumChoice Online PC Services Brainshark Polaris Venture Partners BroadSoft PriceMetrix BzzAgent Reva Systems Cedar Point Communications Revolabs Ceres SafeNet Certeon Sandbridge Technologies Certica Solutions Security Innovation cMarket Silver Peak Systems Code:Red SIMtone ConsumerPowerline SkyeTek CorporateRewards SoloHealth Courion Sonicbids Crossbeam Systems StyleFeeder Cyber-Ark Software TAGSYS DAFCA Tatara Systems Demandware Tradeware Global Desktone Tutor.com Epic Advertising U4EA Technologies ExtendMedia Ubiquisys Fidelis Security Systems UltraCell Flagship Ventures Vanu Fortisphere Versata Enterprises GENBAND Visible Assets General Catalyst Partners VKernel Hearst Interactive Media VPIsystems Highland Capital Partners Ze-gen HSMC ZoomInfo Invention Machine @Ventures Address: 187 Ballardvale Street, Suite A260 800 Menlo Ave, Suite 120 Wilmington, MA 01887 Menlo Park, CA 94025 Phone #: (978) 658-8980 (650) 322-3246 Fax #: ND ND Website: www.ventures.com www.ventures.com Business Overview @Ventures provides venture capital and growth assistance to early stage clean technology companies. Established in 1995, @Ventures has funded more than 75 companies across a broad set of technology sectors. The exclusive focus of the firm's fifth fund, formed in 2004, is on investments in the cleantech sector, including alternative energy, energy storage and efficiency, advanced materials, and water technologies. Speaker Bio: Marc Poirier, Managing Director Marc Poirier has been a General Partner with @Ventures since 1998 and operates out the firm’s Boston-area office.
    [Show full text]
  • Proceedings 1 AGILE Phd School
    REGIONA Regionales Innovationszentrum für nachhaltiges Wirtschaften und Umwelt-/Geoinformation Band 2 Hardy Pundt, Lars Bernard (eds.) Proceedings 1st AGILE PhD School March 13-14, 2012 Wernigerode, Germany Shaker Verlag Aachen 2012 i Preface The Association of Geographic Information Laboratories in Europe (AGILE, http://www.agile- online.org/) seeks to ensure that the views of the geographic information teaching and research community are fully represented in the discussions that take place on future European research agendas. AGILE also provides a permanent scientific forum where geographic information researchers can meet and exchange ideas and experiences at the European level. The activities of AGILE are managed by an eight person council elected by the member organizations. Its main tasks are to develop an organizational structure to release the goals of AGILE, to further develop with the help of the members a European research agenda, to instigate and stimulate AGILE initiatives, to network with other organizations working in the field of Geoinformation and concerning technologies and to organize the yearly AGILE conference on Geographic Information Science. During one of the last meetings, the Council discussed to strengthen the focus on young researchers who are doing challenging scientific work within the more than 80 member laboratories. Therefore, the Council decided to implement a new initiative that is focused specifically on PhD candidates. This decision was the advent of the AGILE PhD School. The first of its kind has now been carried out at the University of Applied Sciences in Wernigerode, Germany. PhD students and senior scientists met in Wernigerode to present their subjects, exchange ideas and to discuss critically experiences, results and challenges.
    [Show full text]
  • GEOINT 2021 Prospectus
    Discovery and Connections October 5-8, 2021 St. Louis, Missouri PROSPECTUS Sponsor and Exhibitor Dates October 6-8 EXHIBIT HALL FEATURING • Aerial Imaging • Location-Based Services • Artificial Intelligence • Machine/Deep Learning • Big Data Analytics • Mapping Tools • Cloud Computing • Mobile Applications • Commercial Satellite Imaging • Mobile Platforms • Consulting • Open Source Intelligence • Content Managment • Remote Sensing • Cyber Security • Sensors • Geographic Information Systems • Simulation • Global Positioning Hardware & • Small Sats Software • UAVs • IC ITE • Visualization Software • Image Processing • And Much More! • Integration Services and Software GEOINT2021.com ATTRACTING HIGHLY QUALIFIED BUYERS Acquisition Directorate Architect Sensing Portfolio Director AND KEY DECISION MAKERS Admiral Chief Systems Engineer President Analyst Engineer Executive Director Principal Architect Assistant Executive Chief Technology Executive Vice Principal Data Officer Director President Scientist Collection Manager Associate Vice Founder Principal Engineer President Colonel Functional Principal Intelligence Board Member Combat Development Management Analyst Analyst Executive Branch Chief Professor Commandant General Branch Head Program Director Commander General Council Budget Director Publisher Business Development Commanding General General Manager R&D Engineer Executive Congressional Liaison GEOINT Chief R&D Scientist Business Development Contract Officer GEOINT Division Chief Manager Senior All-source Contracting Officer Geospatial
    [Show full text]
  • Geographic Information Retrieval: Classification, Disambiguation
    Imperial College London Department of Computing Geographic Information Retrieval: Classification, Disambiguation and Modelling Simon E Overell Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, July 2009 1 2 Abstract This thesis aims to augment the Geographic Information Retrieval process with information extracted from world knowledge. This aim is approached from three directions: classifying world knowledge, disambiguating placenames and modelling users. Geographic information is becoming ubiquitous across the Internet, with a significant proportion of web documents and web searches containing geographic entities, and the proliferation of Internet enabled mobile devices. Traditional information retrieval treats these geographic entities in the same way as any other textual data. In this thesis I augment the retrieval process with geographic information, and show how methods built upon world knowledge outperform methods based on heuristic rules. The source of world knowledge used in this thesis is Wikipedia. Wikipedia has become a phenomenon of the Internet age and needs little introduction. As a linked corpus of semi-structured data, it is unsurpassed. Two approaches to mining information from Wikipedia are rigorously explored: initially I classify Wikipedia articles into broad categories; this is followed by much finer classification where Wikipedia articles are disambiguated as specific locations. The thesis concludes with the proposal of the Steinberg hypothesis: By analysing a range of wikipedias in different languages I demonstrate that a localised view of the world is ubiquitous and inherently part of human nature. All people perceive closer places as larger and more important than distant ones.
    [Show full text]
  • Beyond Search: What to Do When Your Enterprise Search System Doesn't Work
    Research Report Beyond Search: What to do when Your Enterprise Search System Doesn't Work April 2, 2008 by Stephen E. Arnold Gilbane Group Inc. 763 Massachusetts Avenue Cambridge, MA 02139 USA Tel: 617.497.9443 Fax: 617.497.5256 [email protected] http://gilbane.com Table of Contents List of Figures ........................................................................................................vi List of Tables....................................................................................................... viii Preface .............................................................................................. x Executive Summary............................................................................ 1 Fixing a Broken Search System.............................................................................1 How Do You Avoid Common Mistakes?...............................................................1 Beyond Key Words............................................................................................... 2 The Next-Generation Search Market................................................................... 2 The Market Layout............................................................................................... 4 Beyond Search at a Glance....................................................................................5 I. Introduction: Setting the Stage....................................................... 9 Search 2008 .......................................................................................................
    [Show full text]