Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Applications

Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Applications Chinatsu Aone and John Maloney SRA International 4300 Fair Lakes Court Fairfax, VA 22033 [email protected], maloneyj ~,~sra.corn Abstract specifically the choice of high-recall or high-precision strategies. But first, we discuss the relevant features SRA's proprietary product, NameTag TM, which provides fast and accurate name of NameTag. recognition, has been reused in many applications in recent and ongoing efforts, in- 2 Description of NameTag cluding multilingual information retrieval and browsing, text clustering, and assis- NameTag is a multilingual name recognition system. tance to manual text indexing. This paper It finds and disambiguates in texts the names of peo- reports on SRA's experience in embedding ple, organizations, and places, as well as time and name recognition in these three specific ap- numeric expressions with very high accuracy. The plications, and the mutual impacts that oc- design of the system makes possible the dynamic cur, both on the algorithmic level and in recognition of names: NameTag does not rely on the role that name recognition plays in user long lists of known names. Instead, NameTag makes interaction with a system. In the course use of a flexible pattern specification language to of this, we touch upon various interactions identify novel names that have not been encountered between proper name recognition and ma- previously. In addition, NameTag can recognize and chine translation (MT), as well as the role link variants of names in the same document auto- of accurate name recognition in improving matically. For instance, it can link "IBM" to "Inter- the performance of word segmentation al- national Business Machines" and "President Clin- gorithms needed for languages whose writ- ton" to "Bill Clinton." ing systems do not segment words. NameTag incorporates a language-independent C-t-+ pattern-matching engine along with the language-specific lexicons, patterns, and other re- 1 Introduction sources necessary for each language. In addition, Fast and accurate name recognition products are the Japanese, Chinese, and Thai versions integrate only now coming onto the market. SRA's propri- word segmenters to deal with the orthographic chal- etary product, NameTag, has been reused in many lenges of these languages. (NameTag currently has applications in recent and ongoing efforts, including these language versions available plus ones for En- multilingual information retrieval and browsing, text glish, Spanish, and French.) clustering, and assistance to manual text indexing. NameTag is an extremely fast and robust system In the following paper, we report on our experience that can be easily integrated with other applications in embedding name recognition in these, three spe- through its API. It has been our experience that cific applications, as well as the mutual impacts that NameTag has lent itself to so many successful inte- occur, both on the algorithmic level and in the role grations in diverse applications not just due to its ac- that name recognition plays in user interaction with curacy, but to its speed. (Its NT version is currently a system. In the course of this, we touch upon var- benchmarked at 300 megabytes/hour on a Pentium ious interactions between proper name recognition Pro.) It is an attractive package to embed in an ap- and machine translation (MT), as well as the role plication, as it does not cause significant retardation of accurate name recognition in improving the per- of performance. formance of word segmentation algorithms needed In the following discussion, we refer to various for languages such as Japanese. Name recognition versions of NameTag, most prominently systems for clearly offers added value when integrated with other English and Japanese. Their extraction accuracy algorithms and systems, but the latter also affect varies. For example, in the Sixth Message Un- the way in which name recognition is performed, derstanding Conference (MUC-6), the English sys- tern was benchmarked against the Wall Street Jour- The system consists of an Indexing Module, a nal blind test set for the name tagging task, and Client Module, and a Term Translation Module. achieved a 96% F-measure, which is a combination The Indexing Module creates and inserts indices into ot" recall and precision measures. Our internal test- a database while the Client, Module allows browsing ing of the Japanese system against blind test sets of and retrieval of information in the database through w~rious Japanese newspaper articles indicates that a Web-browser-based graphical user interface ((~ IJ l). it achieves from high-80 to 1ow-90% accuracy, de- The Term Translation Module dynamically trans- pending on the types of corpora. Indexing names lates English user queries into Japanese and the in- in Japanese texts is usually more challenging than dexed terms in retrieved Japanese documents into English for two main reasons. First, there is no case English. distinction in .Japanese, whereas English names in newspapers are capitalized, and capitalization is a very strong clue for English name tagging. Sec- The Indexing Module ond, Japanese words are not separated by spaces and For the present application, the system indexes therefore must be segmented into separate words be- names of people, entities, and locations, as well as fore the name tagging process. As segmentation is scientific and technical (S&T) terms in both En- not 100% accurate, segmentation errors can some- glish and Japanese texts, and allows the user to times can use name tagging rules not to fire or to misfire. query and browse the indexed database in English. As NameTag processes texts, the indexed terms are stored in a relational database with their semantic 3 Proper Name Recognition type information (person, entity, place, S&T term) Integrated With a Browsing & and alias information along with such meta data as Retrieval System source, date, language, and frequency information. We have recently developed a system incorporating NarneTag that allows monolingual users to access The Client Module information on the World Wide Web in languages that they do not know (Aone, Charocopos, and Gor- The Client Module lets the user both retrieve and linsky, 1997). For example, previously it was not browse information in the database through the easy for a monolingual English speaker to locate nec- Web-browser-based GUI. In the query mode, a form- essary information written in Japanese. The user based Boolean query issued by a user is automati- would not know the query terms in Japanese even cally translated into an SQL query, and the English if the search engine accepted Japanese queries. In terms in the query are sent to the Term Translation addition, even when the users located a possibly rel- Module. The Client Module then retrieves docu- evant text in Japanese, they would have little idea ments which match either the original English query about what was in the text. Output of off-the-shelf or the translated .Japanese query. As the indices machine translation (MT) systems are often of low are names and terms which may consist of multiple quality, and even "high-end" MT systems have prob- words (e.g., "Warren Christopher," "memory chip"), lems particularly in translating proper names and the query terms are delimited in separate boxes in specialized domain terms, which often contain the the form, making sure no ambiguity occurs in both most critical information to the users. translation and retrieval. Now these users have available our multilingual In its browsing mode, the Client Module allows (or cross-linguistic) information browsing and re- the user to browse the information in the database trieval system, which is aimed at monolingual users in various ways. For example, once the user selects who are interested in information from multiple lan- a particular document for viewing, the client sends guage sources. The system takes advantage of name- it to an appropriate (i.e., English or Japanese) in- recognition software as embodied in NameTag to im- dexing server for creating hyperlinks for the indexed prove the accuracy of cross-linguistic retrieval and terms, and, in the case of a Japanese document, to provide innovative methods t.o browse and ex- sends the indexed terms to the Term Translation plore multilingual document collections. The sys- Module to translate the Japanese terms into English. tem indexes texts in different languages (currently The result that the user browses is a document each English and Japanese) and allows the users to re- of whose indexed terms are hyperlinked to other doc- trieve relevant texts in their native language (cur- uments containing the same indexed terms. Since rently English). The retrieved text is then presented hyperlinking is based on the original or translated to the users with proper names and specialized do- English terms, the monolingual English speaker can main terms translated and hyperlinked. Among the follow the links to both English and .Japanese docu- innovations in our system is the stress placed upon ments transparently. In addition, the Client Module proper names and their role as indices for document is integrated with a commercial MT system for a content. rough translation of the whole text. The Term Translation Module Extraction first. MT second." In particular, transla- The Term Translation Module is used by the Client tion quality of names by even the best NIT systems Module bi-directionally in two different modes. was poor. In an indexing and retrieval application That is. it translates English query terms into such as the one under discussion, the proper identi- Japanese in the query mode and, in reverse, trans- fication and translation of names are critical. lates Japanese indexed terms into English for view- There are two cases where an MT system fails to ing of a retrieved Japanese text in the browsing translate names.

Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Applications

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support