Very Large Lexical Databases: A Tutorial Prepared for LingoMotors James Pustejovsky LingoMotors Inc., 585 Mass. Ave., Cambridge, MA. 02139
[email protected] Patrick Hanks LingoMotors Inc., 585 Mass. Ave., Cambridge, MA. 02139
[email protected] July 26, 2001 LingoMotors Cambridge, MA 1 Themes and Issues 1. Availability and Use of Very Large Corpora 2. Statistical techniques for corpus and word context analysis 3. Lexical Architecture motivated by Linguistic Data 4. Heuristics and Algorithms for Lexical Acquisition 2 Tutorial Outline 1. Challenging Orthodox Assumptions about Lexicons 2. What is a Very Large Lexical Database? 3. Case Study 1: Contributions of Corpus Analysis to Lexicon Design 4. COFFEE BREAK 5. Lexicon Architectures: Possible versus Probable Mean- ings 6. Case Study 2: Brilliant Parser and Semantics meet Dull Reality: Reality Bites back 7. Round-up 3 Introduction: Challenging Orthodox Assumptions about Lexicons Questions: (1) What can we do with existing lexical resources? (2) What are the evaluation criteria for a good NLP lexicon? (3) Where do lexicons come from, design or corpora? Conclusions: 1. Regarding lexicons: Lexicons are for something. The all-purpose lexicon is myth. There are shared structures, properties, features, forms, and contexts, but it is best to think of lexicons as multiple and application-driven. 2. Regarding senses: Word senses do not exist as objects in them- selves. Words have senses, but there there are no senses independent of the contextu- alized word. (More precisely, a word has a 4 meaning potential, components of which are activated when a word is put together with other words in a context.) 5 Session 1: What is a Very Large Lexical Database? What’s the relationship between corpus and lexicon? • Corpus: an accumulation of tokens • Lexicon: an ordered collection of word-types (spelling forms) or lemmas, with data attached.