Precise Location Extraction in Textual News Content Using ﬁne-Tuned BERT Based Language Model

Mapping Local News Coverage: Precise location extraction in textual news content using fine-tuned BERT based language model Sarang Gupta ∗ Kumari Nishu ∗ Data Science Institute Data Science Institute Columbia University, NY Columbia University, NY [email protected] [email protected] Abstract again Saturday, Oct. 15, at Temple Sinai, 5505 Forbes Ave., Pittsburgh.” passed through the Stan- Mapping local news coverage from textual ford Named Entity Tagger, returns “Temple Sinai”, content is a challenging problem that re- ‘Forbes Ave.” and ‘Pittsburgh” as separate location quires extracting precise location mentions entities. However, to accurately map the location of from news articles. While traditional named entity taggers are able to extract geo-political interest, one requires the whole address - ‘Temple entities and certain non geo-political entities, Sinai, 5505 Forbes Ave., Pittsburgh.’ to be returned they cannot recognize precise location men- as a single location. tions such as addresses, streets and intersec- In recent years, there has been an advent of tions that are required to accurately map the powerful pre-trained deep learning based language news article. We fine-tune a BERT-based lan- models such as Google’s Bidirectional Encoder guage model for achieving high level of granu- Representations from Transformers (BERT) (De- larity in location extraction. We incorporate the model into an end-to-end tool that fur- vlin et al., 2018), XLNet (Yang et al., 2019) and ther geocodes the extracted locations for the OpenAI’s GPT-2 (Radford et al., 2018). These broader objective of mapping news coverage. models can be fine-tuned for specific classification tasks in the absence of abundant training data and 1 Introduction thus are often helpful with weak supervision (Rat- ner et al., 2017). We fine-tune the BERT model A media or news desert is an uncovered geographi- to extract precise location mentions in their en- cal area that has few or no news outlets and receives tirety as iterated in the previous paragraph. We first little coverage. Mapping locations mentioned in build a dataset of about 10,000 sentences extracted news articles is the primary step in identifying news from a corpus of 80,000 news articles spanning deserts. A key challenge in the process is to manu- Jan 2018 to June 2019 published in Philadelphia ally peruse the corpus of news articles, identify the Inquirer newspaper. We use Amazon Mechanical location mentions and assign spatial coordinates Turk (MTurk) to label the locations of interest in which can then be placed on a map to identify a the sentences and fine-tune the BERT based NER newsroom’s coverage. tagger to classify the words in a text. Finally, we While conventional Named Entity Recognition incorporate geocoding of the extracted locations (NER) taggers such as those offered by spaCy into the pipeline. (Honnibal and Montani, 2017), Natural Language In this work, we present an end-to-end system to Toolkit (NTLK) (Bird et al., 2009) and Stanford extract geographic data from text. This tool could NLP (Finkel et al., 2005) group contain tags to be particularly useful for newsrooms to map the identify organizations, geo-political entities (GPE) coverage of their printed content helpful in iden- and certain non-GPE locations such as mountain tifying news deserts or by researchers and other ranges and bodies of water from text, they are not organizations to extract precise location mention able extract precise location mentions such as ad- in text. Our main contributions are as follows: dresses, streets or intersections in their entirety. (1) Preparing a dataset containing sentences with For example, the sentence - “The family will hold words tagged as geopolitical entities, organizations, shivah from 7 to 9 p.m. Thursday, Oct. 13, and streets and addresses. (2) Fine-tuning an existing ∗ Equal contribution BERT based NER tagger to identify the aforemen- 155 Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science, pages 155–162 Online, November 20, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 tioned location entities. (3) Building an end-to-end To our knowledge, no previous work has been system that consumes news content, transforms done to fine-tune a pre-trained language-based deep into BERT readable format and returns a list of the learning model to achieve the level of precision and geocoded locations mentioned in the content. granularity in location extraction that is required The overall process is illustrated in Figure1. We for the purpose of mapping news coverage. Fur- will first discuss our modelling approach followed thermore, there does not exist an open source end- by the process that we followed for geocoding the to-end tool to extract and geocode precise locations extracted locations. from a piece of text. 2 Related Work 3 Model Named Entity Recognition (NER) is a well studied area in the field of Natural Language Processing Our approach to developing a named-entity tag- (NLP). It aims to identify different types of enti- ger for the task of precise location extraction in- ties such as people, organizations, nationalities and volves fine-tuning an existing neural network on locations in text. Tools trained on conventional a target dataset. Fine-tuning a model updates its NER models such as Conditional Random Fields pre-trained parameters, improving its performance (Lafferty et al., 2001), Maximum Extropy (Rat- on the downstream NLP task. naparkhi, 2016) and LabeledLDA (Ramage et al., 2009) have been successful in identifying common We treat the task of named-entity tagging in named entities. However, challenge comes when a sentence as that of token classification within high level of granularity is of interest in extracting a sequence, assigning each word (token) in the location entities such as specific addresses, streets sentence (sequence) a label. The fine-tuning pro- or intersections. cess for token classification involves: (1) Preparing Lingad et al.(2013) evaluated the effective- the training dataset with expected labels for to- ness of existing NER tools such as Stanford NER, kens within each sequence (2) Loading an existing OpenNLP and Yahoo! PlaceMaker on extracting lo- model with pre-trained weights (3) Extending the cations from disaster-related tweets. Brunsting et al. model with a classification layer at the end with (2016) presented an approach combining NER and number of nodes equal to the number of classes Parts-of-Speech (POS) tagging to develop a set of in the task at hand (4) Training the model on the heuristics to achieve higher granularity for location target dataset. extraction in text. We will use Bidirectional Encoder Representa- There has been some work around the use of tions from Transformers (BERT), developed by end-to-end neural architecture on several sequence Google AI in 2018 as the pre-trained model. BERT labelling tasks including NER (Chiu and Nichols, makes use of multiple multi-head attention layers 2015) and POS (Meftah and Semmar, 2018) tag- to learn bidirectional embeddings for input tokens. ging. Magnolini et al.(2019) explored the use of It is trained for masked language modeling, where external gazetteers for entity recognition with neu- a fraction of the input tokens in a given sequence ral models showing that extracting features from a are masked and the task is to predict the masked rich model of the gazetteer and then concatenating word given its context (Devlin et al., 2018). Our such features with the input embeddings of a neural decision to choose BERT is motivated by the fact model outperforms conventional approaches. that it is a general purpose language representa- Fine-tuning pre-trained language model for tion model pre-trained on millions of articles on domain-specific machine learning tasks has be- English Wikipedia and BookCorpus. Given the di- come increasingly convenient and effective. Lee versity of topics present on these two training sets, et al.(2019) introduced BioBERT, a BERT based we believe BERT would be able to generalize well biomedical language representation model for to our dataset containing news articles. BERT’s biomedical text mining and Xue et al.(2019) pre- use of WordPiece tokenizer mitigates the out-of- sented a fine-tuned BERT model for entity and vocabulary issue while tokenizing location names relation extraction in Chinese medical text. Liu which are often proper nouns. With minimal archi- (2019) presented advances in extractive summa- tecture modification, BERT can be applied to our rization using a fine-tuned BERT model. NER task. 156 Figure 1: Overview of our methodology 3.1 Dataset Preparation Our goal is to create a labelled dataset for token classification with each token labelled as being part of a location entity or not. For this purpose, we as- sembled a dataset of 10,000 sentences drawn from a corpus of 80,000 news articles ranging from Jan- uary 2018 to June 2019 published in the Philadel- phia Inquirer newspaper. The articles represented ‘beats’ including politics, opinions, sports, food and travel among others, published by a number of different authors. The articles originated from 10 different news sources which had their articles published on Philadelphia Inquirer website. Since, majority of the sentences in the articles did not contain a location mention, sending the whole article for tagging on Amazon MTurk was not cost-efficient. To overcome this, we broke down the articles into individual sentences using Figure 2: Heuristics to identify locations of interest spaCy’s Sentencizer and devised a set of heuristics and custom rules on top of spaCy’s NER system A number, noun and a prepositional tag preceding a to capture sentences with mentions of addresses, place were collectively used to identify a precise lo- streets and intersections. Through exploratory anal- cation in the referenced place.

Load more