Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit∗ Nathan Schneidery Rishav Bhowmick∗ Kemal Oflazer∗ Noah A. Smithy School of Computer Science, Carnegie Mellon University ∗P.O. Box 24866, Doha, Qatar yPittsburgh, PA 15213, USA fbehrang@,nschneid@cs.,rishavb@qatar.,ko@cs.,[email protected] Abstract delineated. One hallmark of this divergence be- tween Wikipedia and the news domain is a dif- We consider the problem of NER in Arabic ference in the distributions of named entities. In- Wikipedia, a semisupervised domain adap- deed, the classic named entity types (person, or- tation setting for which we have no labeled ganization, location) may not be the most apt for training data in the target domain. To fa- articles in other domains (e.g., scientific or social cilitate evaluation, we obtain annotations topics). On the other hand, Wikipedia is a large for articles in four topical groups, allow- ing annotators to identify domain-specific dataset, inviting semisupervised approaches. entity types in addition to standard cate- In this paper, we describe advances on the prob- gories. Standard supervised learning on lem of NER in Arabic Wikipedia. The techniques newswire text leads to poor target-domain are general and make use of well-understood recall. We train a sequence model and show building blocks. Our contributions are: that a simple modification to the online learner—a loss function encouraging it to • A small corpus of articles annotated in a new “arrogantly” favor recall over precision— scheme that provides more freedom for annota- substantially improves recall and F . We 1 tors to adapt NE analysis to new domains; then adapt our model with self-training on unlabeled target-domain data; enforc- • An “arrogant” learning approach designed to ing the same recall-oriented bias in the self- boost recall in supervised training as well as training stage yields marginal gains.1 self-training; and • An empirical evaluation of this technique as ap- plied to a well-established discriminative NER 1 Introduction model and feature set. This paper considers named entity recognition Experiments show consistent gains on the chal- (NER) in text that is different from most past re- lenging problem of identifying named entities in search on NER. Specifically, we consider Arabic Arabic Wikipedia text. Wikipedia articles with diverse topics beyond the commonly-used news domain. These data chal- 2 Arabic Wikipedia NE Annotation lenge past approaches in two ways: Most of the effort in NER has been fo- First, Arabic is a morphologically rich lan- cused around a small set of domains and guage (Habash, 2010). Named entities are ref- general-purpose entity classes relevant to those erenced using complex syntactic constructions domains—especially the categories PER(SON), (cf. English NEs, which are primarily sequences ORG(ANIZATION), and LOC(ATION) (POL), of proper nouns). The Arabic script suppresses which are highly prominent in news text. Ara- most vowels, increasing lexical ambiguity, and bic is no exception: the publicly available NER lacks capitalization, a key clue for English NER. corpora—ACE (Walker et al., 2006), ANER (Be- Second, much research has focused on the use najiba et al., 2008), and OntoNotes (Hovy et al., of news text for system building and evaluation. 2006)—all are in the news domain.2 However, Wikipedia articles are not news, belonging instead to a wide range of domains that are not clearly 2OntoNotes contains news-related text. ACE includes some text from blogs. In addition to the POL classes, both 1The annotated dataset and a supplementary document corpora include additional NE classes such as facility, event, with additional details of this work can be found at: product, vehicle, etc. These entities are infrequent and may http://www.ark.cs.cmu.edu/AQMAR not be comprehensive enough to cover the larger set of pos- History Science Sports Technology Table 1: Translated titles of Arabic Wikipedia arti- dev: Damascus Atom Raul´ Gonzales´ Linux Imam Hussein Shrine Nuclear power Real Madrid Solaris cles in our development and test sets, and some test: Crusades Enrico Fermi 2004 Summer Olympics Computer NEs with standard and Islamic Golden Age Light Christiano Ronaldo Computer Software article-specific classes. Islamic History Periodic Table Football Internet Additionally, Prussia and Ibn Tolun Mosque Physics Portugal football team Richard Stallman Amman were reserved Ummaya Mosque Muhammad al-Razi FIFA World Cup X Window System for training annotators, Claudio Filippone (PER) ; Linux (SOFTWARE) ; Spanish àñJ.ÊJ ¯ ñK XñÊ¿ ºJJ Ë and Gulf War for esti- CHAMPIONSHIPS PARTICLE League ( ) úGAJ.B@ øPðYË@; proton ( ) àñ KðQK .; nuclear mating inter-annotator agreement. radiation (GENERIC-MISC) ; Real Zaragoza (ORG) ø ðñJË@ ¨AªB@ 颯Qå ÈAK P appropriate entity classes will vary widely by do- testing examples, but not as training data. In x4 main; occurrence rates for entity classes are quite we will discuss our semisupervised approach to different in news text vs. Wikipedia, for instance learning, which leverages ACE and ANER data (Balasuriya et al., 2009). This is abundantly as an annotated training corpus. clear in technical and scientific discourse, where much of the terminology is domain-specific, but it 2.1 Annotation Strategy holds elsewhere. Non-POL entities in the history We conducted a small annotation project on Ara- domain, for instance, include important events bic Wikipedia articles. Two college-educated na- (wars, famines) and cultural movements (roman- tive Arabic speakers annotated about 3,000 sen- ticism). Ignoring such domain-critical entities tences from 31 articles. We identified four top- likely limits the usefulness of the NE analysis. ical areas of interest—history, technology, sci- Recognizing this limitation, some work on ence, and sports—and browsed these topics un- NER has sought to codify more robust invento- til we had found 31 articles that we deemed sat- ries of general-purpose entity types (Sekine et al., isfactory on the basis of length (at least 1,000 2002; Weischedel and Brunstein, 2005; Grouin words), cross-lingual linkages (associated articles et al., 2011) or to enumerate domain-specific in English, German, and Chinese3), and subjec- types (Settles, 2004; Yao et al., 2003). Coarse, tive judgments of quality. The list of these arti- general-purpose categories have also been used cles along with sample NEs are presented in ta- for semantic tagging of nouns and verbs (Cia- ble 1. These articles were then preprocessed to ramita and Johnson, 2003). Yet as the number extract main article text (eliminating tables, lists, of classes or domains grows, rigorously docu- info-boxes, captions, etc.) for annotation. menting and organizing the classes—even for a Our approach follows ACE guidelines (LDC, single language—requires intensive effort. Ide- 2005) in identifying NE boundaries and choos- ally, an NER system would refine the traditional ing POL tags. In addition to this traditional form classes (Hovy et al., 2011) or identify new entity of annotation, annotators were encouraged to ar- classes when they arise in new domains, adapting ticulate one to three salient, article-specific en- to new data. For this reason, we believe it is valu- tity categories per article. For example, names able to consider NER systems that identify (but of particles (e.g., proton) are highly salient in the do not necessarily label) entity mentions, and also Atom article. Annotators were asked to read the to consider annotation schemes that allow annota- entire article first, and then to decide which non- tors more freedom in defining entity classes. traditional classes of entities would be important Our aim in creating an annotated dataset is to in the context of article. In some cases, annotators provide a testbed for evaluation of new NER mod- reported using heuristics (such as being proper els. We will use these data as development and 3These three languages have the most articles on Wikipedia. Associated articles here are those that have been sible NEs (Sekine et al., 2002). Nezda et al. (2006) anno- manually hyperlinked from the Arabic page as cross-lingual tated and evaluated an Arabic NE corpus with an extended correspondences. They are not translations, but if the associ- set of 18 classes (including temporal and numeric entities); ations are accurate, these articles should be topically similar this corpus has not been released publicly. to the Arabic page that links to them. Token position agreement rate 92.6% Cohen’s κ: 0.86 History: Gulf War, Prussia, Damascus, Crusades Token agreement rate 88.3% Cohen’s κ: 0.86 WAR CONFLICT ••• Token F between annotators 91.0% 1 Science: Atom, Periodic table Entity boundary match F 94.0% 1 THEORY • CHEMICAL •• Entity category match F 87.4% 1 NAME ROMAN • PARTICLE •• Table 2: Inter-annotator agreement measurements. Sports: Football, Raul´ Gonzales´ SPORT ◦ CHAMPIONSHIP • nouns or having an English translation which is AWARD ◦ NAME ROMAN • conventionally capitalized) to help guide their de- Technology: Computer, Richard Stallman termination of non-canonical entities and entity COMPUTER VARIETY ◦ SOFTWARE • classes. Annotators produced written descriptions COMPONENT • of their classes, including example instances. Table 3: Custom NE categories suggested by one or This scheme was chosen for its flexibility: in both annotators for 10 articles. Article titles are trans- contrast to a scenario with a fixed ontology, anno- lated from Arabic. • indicates that both annotators vol- tators required minimal training beyond the POL unteered a category for an article; ◦ indicates that only conventions, and did not have to worry about one annotator suggested the category. Annotators were delineating custom categories precisely enough not given a predetermined set of possible categories; that they would extend straightforwardly to other rather, category matches between annotators were de- topics or domains. Of course, we expect inter- termined by post hoc analysis. NAME ROMAN indi- cates an NE rendered in Roman characters. annotator variability to be greater for these open- ended classification criteria.