<<

ENHANCING NAMED-ENTITY RECOGNITION WITH DBPEDIA

Ramya Kalathingal and Maurani Saha Student Scholarship Week Guided By : Dr. Richard Scherl

Introduction Dependency DBpedia Future Work

An important component of processing natural language texts is • The data stored in DBpedia is written in • Make use of the wide variety of predicates in to identify the entities named ​in the text. For example, news • A dependency parser builds a parse incrementally as it scans RDF (Resource Description Framework). DBpedia. articles talk about politicians, companies, cities, dates and so on. the words in a sentence. The approach is relatively fast as the Statements in RDF are triples of subjects (a • Use the added information provided by Dbpedia We have been using off-the-shelf NLP (natural language time taken is linear in the length of the sentence. thing represented by a URI, a predicate, and to enhance the quality of the output of clustering processing) tools, including parsers and a named entity • A dependency parse is an analysis of the grammatical a value). and classification algorithms when applied to recognizer. Even though the named entity recognizer is state-of- structure of a sentence, establishing relationships between • The current version of DBpedia describes news articles. the art, there is significant room for improvement as it makes "head" words and words which modify those heads. The parse 4.58 million things. Most of these have a URI some errors and identifies named entities as belonging to a • Use techniques such as locality-sensitive is represented as a set of triples (the name of the relation, the which is resolvable to a web page. hashing to speed up the string matching process limited set of categories. One of our aims is to improve the depth governing word and the dependent or modifying word). It also Additionally most of the things have a type of the information about named entities. has a natural graphical representation. relation with a value in the DBpedia ontology. Most have the primaryTopic predicate Web Resources • We have experimented with two different dependency parsers, relating it to the relevant Wikipedia page. MaltParser and DependencyParseAnnotator, which is part of • Dbpedia: http://wiki.dbpedia.org/ Research Question • There are many other predicates describing the Stanford CoreNLP suite of tools. Both are written in Java. • Malt Parser: http://www.malt.marser.org different entities, e.g., birthdate, title (position DBpedia be used to enhance named-entity recognition? We have used them from Java and also with a Python wrapper Can held for elected officials). • Natural Language Toolkit: provided by the NLTK toolkit. http://www.nltk.org/ Methods • Given below is a graphical representation of the dependency • Semantic Web: http://semanticweb.org parse for the text: A Portion of the DBpedia • Stanford Core NLP Tools: • We are using a state-of-the-art parser to process natural “Albert Einstein won the 1921 Nobel Prize in Physics.” Ontology for Person http://stanfordnlp.github.io/CoreNLP/ language texts such as news articles. The parser uses a named-entity recognizer (NER) to identify phrases as names of • Ambassador References people, organizations, and locations. • Archaeologist • DBpedia is a semantically organized database that is • Architect • Daniel Jurafsky and James H. Martin. Speech and • Aristocrat automatically created from the structured content of Wikipedia. • Artist (Actor, Comedian, ComicsCreator, Dancer, Language Processing, Second Edition, Prentice In accordance with the principles of the Semantic Web, FashionDesigner, Humorist, MusicalArtist, Painter, Hall. 2008. Photographer, Sculptor). DBpedia has a large ontology that further categorizes people, • Jens Lehmann, Robert Isele, Max Jakob, Anja organizations and locations. • Astronaut • Athlete (ArcherPlayer, AthleticsPlayer, Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, • By accessing the latest archives of DBpedia, we are Stanford NERClassifierCombiner AustralianRulesFootballPlayer, BadmintonPlayer, Sebastian Hellmann, Mohamed Morsey, Patrick van experimenting with matching the named entities found in the BaseballPlayer….) Kleef, Sören Auer, and Christian Bizer: DBpedia - A • The NERClassifierCombiner from the Stanford CoreNLP suite • BeautyQueen text with those categorized by DBpedia to obtain a finer- large-scale, multilingual knowledge base extracted of tools identifies named entities in the text and classifies them • BusinessPerson grained categorization. • Celebrity, from Wikipedia. Semantic Web 6(2): 167-195 (2015) as Person, Location, and Organization. It also identifies • Cleric • For example, this technique allows us to identify people as numerical entities as Money, Number, and Percent, and • . • Jure Leskovec, Anand Rajaram and Jeffrey David scientists or politicians, and organizations as companies or temporal entities as Date and Time. Below is the visual • . Ullman. Mining of Massive Datasets (Second universities. Our goal is to use this information to improve the representation of a the output on the text: • . Edition). Cambridge University Press. 2014. performance of text clustering and classification methods. “Albert Einstein born on March 14, 1879 in Ulm, Württemberg, • Christopher D. Manning, Mihai Surdeanu, John Germany.” Sample Output Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP Natural Language Work Flow `Fractivists’ Increase Pressure on Hillary ProcessingToolkit. Pages 55-60 in Proceedings of Clinton and Bernie Sanders the 52nd Annual Meeting of the Association for Input New York Times April 4, 2016 Computational : System Demonstrations, Articles DBPedia 2014 Archive Output into Below are part of the results from the the file processing of this article. The NER output is on • Christopher D. Manning, Prabhakar Raghvan and the left and the Dbpedia type is on the right Hinrich Schütze. Introduction to , Cambridge University Press. 2008. Extract Search each If Tools & Technology POS Yes Category NER NER Tag for Found Tagging from ('Bernie Sanders', 'PERSON') OfficeHolder Match ? DbPedia ('Greenpeace', 'ORGANIZATION') Organisation PARSER No Python 3, Java 1.8, JDOM Parser, Stanford core NLP Parsers, ('New York', 'LOCATION') AdministrativeRegion Acknowledgments Continue Search Next NER NLTK parsers, Malt Parser. ('Pennsylvania', 'LOCATION') AdministrativeRegion EOF Yes Terminate We thank Teena Gopi for sharing her earlier work on using the No ? ('United States', 'LOCATION') Country Stanford Core NLP Parser and Named Entity Recognizer