Automatic Lexicon Generation for Unsupervised Part-Of-Speech Tagging Using Only Unannotated Text

Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text Dennis V. Pereira Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science In Computer Science Dr. Csaba Egyhazy, Chair Dr. William Frakes Dr. Gabriella Belli August 13, 2004 Falls Church, VA Keywords: automatic, lexicon, lexicon generation, part-of-speech, term categorization ii Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text Dennis V. Pereira Abstract With the growing number of textual resources available, the ability to understand them becomes critical. An essential first step in understanding these sources is the ability to identify the parts-of-speech in each sentence. The goal of this research is to propose, improve, and implement an algorithm capable of finding terms (words in a corpus) that are used in similar ways – a term categorizer. Such a term categorizer can be used to find a particular part-of-speech, i.e. nouns in a corpus, and generate a lexicon. The proposed work is not dependent on any external sources of information, such as dictionaries, and it shows a significant improvement (~30%) over an existing method of categorization. More importantly, the proposed algorithm can be applied as a component of an unsupervised part-of-speech tagger, making it truly unsupervised, requiring only unannotated text. The algorithm is discussed in detail, along with its background, and its performance. Experimentation shows that the proposed algorithm performs within 3% of the baseline, the Penn-TreeBank Lexicon. iii Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text (August 2004) Dennis V. Pereira TABLE OF CONTENTS I. Introduction ......................................................................................................................................................................1 A. Good vs. “cheap” words .............................................................................................................................................1 B. Natural Language Processing.....................................................................................................................................1 C. Hypothesis .................................................................................................................................................................2 II. Literature Review..............................................................................................................................................................2 A. Define Parts-of-Speech...............................................................................................................................................2 1) Types of Words ...................................................................................................................................................3 a) Words with Unambiguous Parts-of-Speech ...................................................................................................3 b) Unambiguous Usage.....................................................................................................................................3 c) Ambiguous Usage – Need for Outside Knowledge ........................................................................................3 2) Foundation Part-of-Speech Computation .............................................................................................................3 3) Using Local Context – “yarzbygu” Example........................................................................................................3 4) Part-of-Speech Computation................................................................................................................................4 a) Multi-Lingual...............................................................................................................................................4 b) Written English Focus..................................................................................................................................4 c) Beyond Written Language............................................................................................................................4 B. Lexicon......................................................................................................................................................................5 1) Need For Lexicon................................................................................................................................................5 a) POS Tagger Architecture..............................................................................................................................6 2) Sample Lexicon – giraffe example.......................................................................................................................6 3) Cost of Creating a Lexicon ..................................................................................................................................6 4) Penn-TreeBank ...................................................................................................................................................7 5) Limitations of Humans Creating Lexicons...........................................................................................................7 C. POS Taggers..............................................................................................................................................................7 1) Brill Taggers – Transformation Based Approach.................................................................................................7 a) Supervised Tagger........................................................................................................................................8 b) Unsupervised Tagger....................................................................................................................................8 c) Brill Tagger Pitfall .......................................................................................................................................8 2) Statistical POS Taggers.......................................................................................................................................9 a) Zipf’s Law....................................................................................................................................................9 b) MXPOST – N-Gram Approach ....................................................................................................................9 c) TNT – HMM Approach................................................................................................................................9 d) BNC – Template Approach ........................................................................................................................10 3) Autotutor – Neural Network Approach..............................................................................................................11 4) Approach Commonality ....................................................................................................................................11 5) Intro to Unknown Word Guessing .....................................................................................................................11 D. Unknown Word Guessing ........................................................................................................................................11 1) Context and Morphology...................................................................................................................................12 2) Unknown Word Guessing Methods ...................................................................................................................12 a) Mikheev’s Approach ..................................................................................................................................12 (1) Framing the Problem...........................................................................................................................13 (2) Naïve and Simple Approaches.............................................................................................................13 (3) Comparing Mikheev and Brill .............................................................................................................13 b) Thede’s Statistical Approach ......................................................................................................................14 iv c) Cucerzan’s Approach .................................................................................................................................14 d) Summarizing Unknown Word Guessing.....................................................................................................14 E. Miscellaneous ..........................................................................................................................................................15 1) Clustering and Multi-Words..............................................................................................................................15 a) Clustering ..................................................................................................................................................15

Automatic Lexicon Generation for Unsupervised Part-Of-Speech Tagging Using Only Unannotated Text

Bright Peaks, Dark Shadows 20 February 2012, by Jason Major

Evidence for Thermal-Stress-Induced Rockfalls on Mars Impact Crater Slopes

General Index

Attachment F – Scope of Services

Cumulated Bibliography of Biographies of Ocean Scientists Deborah Day, Scripps Institution of Oceanography Archives Revised December 3, 2001

2018 FWD President Craig Hughes INSIDE: Conventions • Moh • Lou Laurel • Camp Fund 2 X Match • 2018 Officer Reports Ray S

Flee SOC Iely for the PRESERVATION and ENCOURAGEMENT of ~ARBER SHOP QUARTET SINGING in AMERICA, INC

Russian Museums Visit More Than 80 Million Visitors, 1/3 of Who Are Visitors Under 18

Panospheric Video for Robotic Telexploration

CA Endevor SCM Enterprise Git Bridge Introduction for The

Analiza in Prenova Sistema Upravljanja Z Dokumentacijo V Podjetju

Download Here