United States Patent 19 11 Patent Number: 6,076,088 Paik Et Al
Total Page:16
File Type:pdf, Size:1020Kb
USOO6076088A United States Patent 19 11 Patent Number: 6,076,088 Paik et al. 45) Date of Patent: Jun. 13, 2000 54 INFORMATION EXTRACTION SYSTEM Liddy, Elizabeth D. et al., “An Overview of DR-LINK and AND METHOD USING CONCEPT Its Approach to Document Filtering.” Proceedings of the RELATION CONCEPT (CRC) TRIPLES ARPA Workshop On Human Language Technology, Princ eton, NJ, Mar. 21–24, 1993, pp. 358-362. 76 Inventors: Woojin Paik, 405 Crawford Ave., Syracuse, N.Y. 13224; Elizabeth D. Liddy, Elizabeth D. et al., “Development, Implementation Liddy, 104 Victoria Pl.; Jennifer and Testing of a Discourse Model for Newspaper Texts,” Heverin Liddy, 252 Buckingham Ave., ARPA Workshop On Human Language Technology, Princ both of Syracuse, N.Y. 13210; Ian eton, NJ, Mar. 21-24, 1993, pp. 1-6. Harcourt Niles, Nettleton Commons, Liddy, Elizabeth D., “DR-LINK: A System Update for 313 E. Willow St., Syracuse, N.Y. TREC -2,” Proceedings of Second Text Retrieval Confer 13203; Eileen E. Allen, 4 Dwight Ave., ence (TREC-2), National Inst. of Standards and Technology, Clinton, N.Y. 13323 Aug. 31-Sep. 2, 1993, pp. 1-15. Liddy, Elizabeth D., “Development and Implementation of 21 Appl. No.: 08/795,658 a Discourse Model for Newspaper Texts.” Proceedings of 22 Filed: Feb. 6, 1997 the AAAI Symposium On Empirical Methods in Discourse Interpretation and Generation, Stanford, CA, Dec. 14-17, Related U.S. Application Data 1993, pp. 1-6. 60 Provisional application No. 60/011,369, Feb. 9, 1996, and provisional application No. 60/015,512, Apr. 16, 1996. Liddy, Elizabeth DuRoss, “An Alternative Representation for Documents and Queries,” Proceedings of the 14th (51) Int. Cl. ................................................ G06F 17/30 National Online Meeting, 1993, pp. 279-284. 52 U.S. Cl. ................................................................... 707/5 58 Field of Search ..................................... 707/3, 4, 5, 6 (List continued on next page.) 56) References Cited U.S. PATENT DOCUMENTS Primary Examiner Jack M. Choules Attorney, Agent, or Firm Townsend and Townsend and 5,148,541 9/1992 Lee et al. .................................... 707/2 Crew LLP 5,175.814 12/1992 Anicket al. ...... 345/384 5,301,109 4/1994 Landauer et al. ... 704/9 57 ABSTRACT 5,386,556 1/1995 Hedin et al. ...... ... 707/4 5,418,948 5/1995 Turtle ........ ... 707/4 An information extraction System that allows users to ask 5,418,951 5/1995 Damashek ................................... 707/5 questions about documents in a database, and responds to queries by returning possibly relevant information which is OTHER PUBLICATIONS extracted from the documents. The System is domain Liddy et al. “DR-LINKSystem: Phase I Summary” Tipster: independent, and automatically builds its own Subject Proceedings of the First Workshop, confernce date Sep., knowledge base. It can be applied to any new corpus of text 1933, pp. 93-112, Apr. 1994. with quick results, and no requirement for lengthy manual Liddy et al. “DR-LINK Linguistic-Conceptual Approach to input. For this reason, it is also a dynamic System which can Document Detection”, TREC-1 pp. 113–130, Nov. 1992. acquire new knowledge and add it to the knowledge base Croft Bruce et al., “Applications of Multilingual Text immediately by automatically identifying new names, Retrieval,” Proceedings of the 29th Annual Hawaii Inter events, or concepts. national Conference On System Sciences, 1996, Vol. 5, pp. 98-107. 43 Claims, 12 Drawing Sheets - 110 15 Raw Document CRC CRCOKR KR Documents Processing Extraction Translation Database - 102 ( Of 112 External External CRC Extraction KR Conceptual Knowledge Extraction Rule Format Hierarchy Base Base Guideline 100' 105 110 Query KR to Query Query CRC CRC to KR Document KR Processing Extraction Translation Similarity Measure 20 Query 27 External Proper NameDatabase 6,076,088 Page 2 OTHER PUBLICATIONS Liddy, Elizabeth D., “The Promise of Natural Language Processing for Competitive Intelligence,” Proceedings of Liddy, Elizabeth D. et al., “Text Categorization for Multiple Users Based on Semantic Features from a Machine-Read 10th International Conference of the Society of Competitive able Dictionary,” ACM Transactions on Information Sys Intelligence Professionals, May 4-5, 1995, pp. 328-342. tems, vol. 12, No. 3, Jul. 1994, pp. 278-295. Paik, Woojin et al., “Interpretation of Proper Nouns for Liddy, Elizabeth D. et al., “Document Retrieval Using Information Retrieval,” Proceedings of the ARPA Workshop Linguistic Knowledge,” Proceedings of the RIAO '94 Con On Human Language Technology Princeton, NY, Mar. 21-24, ference, Oct. 11-13, 1994, pp. 106-114. 1993, pp. 1–5. Liddy, Elizabeth et al., “Detection, Generation, and Expan Paik, Woojin et al., “Categorizing and Standardizing Proper sion of Complex Nominals.” Proceedings of the Workshop Nouns for Efficient Information Retrieval,” Corpus Process On Compound Nouns. Multilingual Aspects of Nominal ing for Lexicon Acquisition, MIT Press, Cambridge, MA, Composition, Dec. 2-3, 1994, Geneva, Switzerland, pp. 1995 (Boguraev, B. (ed)) pp. 1-10. 14-18. Liddy, Elizabeth D., “Development and Implementation of Paik, Woojin, Chronological Information Extraction System a Discourse Model for NewspaperTexts, Proceedings of the (CIES), Proceedings of the Dagstuhl On Summarizing Text ARPA Workshop on Human Language Technology, Princ for Intelligent Communication, Saarbruken, Germany, 1995, eton, NJ, Mar. 21–24, 1995, pp. 80-84. pp. 1-5. Liddy, Elizabeth et al., “A Natural Language Text Retrieval Wiener, Michael L. et al., “Intelligent Text Processing, and System With Relevance Feedback,” Proceedings of the 16th Intelligence Tradecraft,” The Journal of Association for National Online Meeting, May 2-6, 1995, pp. 259-261. Global Strategic Intelligence (AGSI), Jul. 1995, pp. 1-8. U.S. Patent Jun. 13, 2000 Sheet 1 of 12 6,076,088 DOCument Database and ASSOCiated Data (KRDatabase) 55 Query and DOCument Query-DOCument Processing Engines Similarity Measurer (CRC Extraction, CRC-KR Translation) 70 User Interface SW: Feedback ACCept Queries Provide Feedback ReSOurces for Ya ACCept Feedback Document & Query Reformulate Queries Processing Output Results Program and Data Storage 45 MOcemS and Processor(s) User interface NetWOrk input & Output Interface Devices SERVER Modem Or 30' User InterfaCe NetWOrk Program and input & Output interface Devices Processor(s) Data Storage Other Clients 10- FIG. U.S. Patent Jun. 13, 2000 Sheet 2 of 12 6,076,088 9 /| 99 ?uÐUn000 Áueno U.S. Patent Jun. 13, 2000 Sheet 4 of 12 6,076,088 OZZ /02 ZOZ 003 U.S. Patent Jun. 13, 2000 Sheet S of 12 6,076,088 U.S. Patent Jun. 13, 2000 Sheet 6 of 12 6,076,088 Z|| U.S. Patent Jun. 13, 2000 Sheet 8 of 12 6,076,088 DE Netscape: Chess Demo HomePage ED KY * Location: http://lucifer:8080/clar9.html What's New? What's Cool? Handbook NetSearch Net Directory Software Proper Noun Expansion/Clarification Word Senses X Rupert Murdoch X financial interest in something: "a stake in Bill Murdoch the company's future" O apole or stake set up to mark Something Charles Murdoch (as the start of a race track) William F. Murdoch a vertical post that a victim is tied to for HutchVision execution by burning X HutchVision O the money risked on a gamble HutchVision LTD HutchVision Channel Services Point in Time CHESS Request Preferences (CHRPS) Past O Analytical Information Present Cause/Effect Dimension Future Attributed Quotations O Predictions U.S. Patent Jun. 13, 2000 Sheet 9 of 12 6,076,088 U.S. Patent Jun. 13, 2000 Sheet 10 of 12 6,076,088 U.S. Patent Jun. 13, 2000 Sheet 11 of 12 6,076,088 O CN se - 3 N 6,076,088 1 2 INFORMATION EXTRACTION SYSTEM tion access Systems. Such as data mining Systems have AND METHOD USING CONCEPT become commercially available; however, until the present RELATION CONCEPT (CRC) TRIPLES invention, domain-independent question-answering Systems Still exist only as experimental prototypes. CROSS-REFERENCE TO RELATED Both types of Systems require a pre-constructed repository APPLICATIONS of information to find answers to users’ questions. Data This application claims priority from, and is a mining Systems commonly utilize Statistical procedures to continuation-in-part of the following provisional detect patterns in data, users are expected to interpret the applications, the disclosures of which are hereby incorpo patterns to find the answers. The current interests and Successful commercial uses of data mining Systems are due rated by reference: to the premise that these Systems are designed to use the Patent application Ser. No. 60/011,369, filed Feb. 9, 1996, Same Set of data which is already used by the legacy entitled “CHRONOLOGICAL INFORMATION EXTRAC database management Systems. TION SYSTEM (CIES),” to Woojin Paik; and In comparison, question-answering Systems are designed Patent application Ser. No. 60/015,512, filed Apr. 16, 15 to provide answers directly to users as if they were involved 1996, entitled “INFORMATION RETRIEVAL SYSTEM in question-answering Sessions with other people. This AND METHOD USING CONCEPT-RELATION requires Systems to perform complex inferencing to draw CONCEPT TRIPLES,” to Woojin Paik and Elizabeth D. answers from organized knowledge bases. Over the years, Liddy. there has been Significant progreSS in the problem-Solving The following applications are also hereby incorporated aspect of AI research; however, there are no practical AI by reference: applications except the ones which are used in a few Patent application Ser. No. 08/696,701, filed Aug. 14, narrowly-defined domains. This is due to the lack of prac 1996, entitled “MULTILINGUAL DOCUMENT tical knowledge bases. Research has demonstrated that RETRIEVAL SYSTEMAND METHOD USING SEMAN building the requisite knowledge bases automatically is TIC VECTORMATCHING,” to Elizabeth D. Liddy, Woojin 25 extremely time consuming and expensive. Paik, Edmund S. Yu, and Ming Li (Attorney Docket No. For a number of years, both manual and automatic 17704-2.00); approaches to constructing knowledge bases have been Patent application Ser.