Combining Text Structure and Meaning to Support Text Mining

Combining Text Structure and Meaning to Support Text Mining Item Type text; Electronic Dissertation Authors McDonald, Daniel Merrill Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 26/09/2021 07:17:32 Link to Item http://hdl.handle.net/10150/194015 COMBINING TEXT STRUCTURE AND MEANING TO SUPPORT TEXT MINING by Daniel Merrill McDonald _________________ A Dissertation Submitted to the Faculty of the COMMITTEE ON BUSINESS ADMINISTRATION In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY WITH A MAJOR IN MANAGEMENT In the Graduate College THE UNIVERSITY OF ARIZONA 2006 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Daniel Merrill McDonald entitled Combining Text Structure and Meaning to Support Text Mining and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy _________________________________________________________ Date: 11/13/2006 Hsinchun Chen _________________________________________________________ Date: 11/13/2006 Jay F. Nunamaker Jr. _________________________________________________________ Date: 11/13/2006 Mohan Tanniru Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. _________________________________________________________ Date: 11/13/2006 Dissertation Director: Hsinchun Chen 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. SIGNED: Daniel Merrill McDonald 4 ACKNOWLEDGEMENTS I thank Dr. Hsinchun Chen for his guidance, for being open to a variety of research areas, and for having my best interests in mind. His feedback and high standards have greatly helped me develop professionally. I wish to thank the members of the Artificial Intelligence Lab. Their friendship and esprit de corps helped lighten the mood during trying times and provide encouragement to meet deadlines and goals. In particular, I would like to thank Byron Marshall for the countless times we bounced ideas back and forth and for his energy that was so contagious. I also want to thank Cathy Larson and Kira Joslin for the friendly way they took care of so many details. I benefited greatly from feedback and insight from multiple professors at the University of Arizona. I wish to thank Drs. Jay Nunamaker, Kurt Fenstermacher, Terry Langendoen, and Mohan Tanniru in particular for their feedback, support, and for being available to students. I also thank Dr. Olivia Sheng at the University of Utah for her encouragement and support in my professional development. Also, I am appreciative of the grants by NSF and NLM/NIH that supported this work. Most of all, I wish to thank my wife Karina McDonald. She was the sanity and balance in my life to help me see this work to its end. Her support and willingness to cover for me so many times allowed me to pursue this dream. Finally, I thank my children Andrew, Blake, Mariah, and Natalie for being an inspiration to me. 5 TABLE OF CONTENTS LIST OF ILLUSTRATIONS……………………………………..9 LIST OF TABLES……………………………………………….10 ABSTRACT……………………………………………………...11 1 INTRODUCTION ........................................................................13 1.1 Prevalence of Textual Data Stores...................................................13 1.2 The Knowledge Discovery Challenge .............................................14 1.3 Text Mining Definition....................................................................15 1.4 Text Mining Background.................................................................18 1.4.1 The Finding Stage .....................................................................18 1.4.2 The Processing Stage ................................................................19 1.4.3 The Analysis Stage....................................................................20 1.5 Text Structure and Meaning in the Processing Stage......................21 1.5.1 The Arizona Summarizer..........................................................22 1.5.2 The Arizona Relation Parser.....................................................22 1.5.3 The Arizona Entity Finder ........................................................23 2 THE ARIZONA SUMMARIZER ...................................................24 2.1 Introduction......................................................................................24 2.2 Literature Review.............................................................................25 2.2.1 Generic Summaries...................................................................26 2.2.1.1 Topic Finding using Document Content............................27 2.2.1.2 Topic Finding using Document Structure..........................28 2.2.1.3 Surface-level Analysis .......................................................28 2.2.1.4 Information Extraction Techniques ...................................30 2.2.2 Query-based Summaries ...........................................................31 2.2.3 Evaluating Summaries ..............................................................32 2.2.3.1 Generic Summary Evaluation ............................................32 2.2.3.2 Query-based Summary Evaluation ....................................33 2.2.3.3 Review of Summaries for Information Seeking ................33 2.2.3.4 Information Seeking Tasks ................................................34 2.2.3.5 Document Context .............................................................35 2.2.3.6 Single Document Context Ignored in Browsing Tools......35 6 TABLE OF CONTENTS – CONTINUED 2.3 Research Questions..........................................................................36 2.3.1 Summarizer Development ........................................................37 2.3.2 Summary Type and Task Experiment ......................................37 2.4 Arizona Summarizer Design............................................................38 2.4.1 Structural Analysis....................................................................40 2.4.2 Sentence and Entity-level Analysis ..........................................42 2.4.2.1 Cue Phrases ........................................................................43 2.4.2.2 Proper Nouns......................................................................44 2.4.2.3 Signature Words.................................................................45 2.4.2.4 Sentence Position in a Paragraph.......................................46 2.4.2.5 Sentence Length.................................................................47 2.5 Arizona Summarizer Extensions .....................................................47 2.5.1 Arizona Full-sentence, Hybrid Summary .................................47 2.5.2 Arizona Snippet, Query-based Summary .................................48 2.6 Research Hypotheses .......................................................................49 2.6.1 Generic Arizona Summarizer Performance..............................49 2.6.2 Summaries of Varying Page-level Context in Information Seeking Tasks .........................................................................................49 2.7 Experimental Design........................................................................50 2.7.1 Arizona Summarizer Experiment: Intrinsic Evaluation ...........50 2.7.2 Using Page-level Context in Information Seeking Tasks.........52 2.8 Experimental Results .......................................................................57 2.8.1 Results of Intrinsic Evaluation Experiment..............................57 2.8.2 Discussion of Intrinsic Summarization Experiment .................59 2.8.3 Experimental Results of Information Seeking Experiment......60 2.8.3.1 Summaries with Browse Tasks..........................................61 2.8.3.2 Summaries with Search Tasks ...........................................62 2.8.3.3 Full-sentence Hybrid Versus Query-based Snippet...........63 2.8.3.4 Overall Summary Performance..........................................64 2.8.3.5 Time Spent on Summaries .................................................65 2.8.4 Discussion of Information Seeking Experiment.......................66 2.8.4.1 Summarization in Context .................................................66 2.8.4.2 Implications for Information Retrieval

Combining Text Structure and Meaning to Support Text Mining

Bulletin of the United States Fish Commission

4 References

Leadership for a Global Workforce: Three International Collaboration Projects

List of Cadet Guard at Execution of John Brown

Glengarry Mcdonalds

Bulletin of the United States Fish Commission Seattlenwf

Chart of the Lower Columbia River by Marshall Mcdonald, Commissioner This Map Was Published in an 1894 Report by the U.S

2020 PCL Regular Season Champs, 2004, 2005, 2015 CPL Champs

Defender of the Gate: the Presidio of San

United States National Museum

Death Certificate Index - Marshall (7-12/1918 & 1933-1939) 5/20/2015

Bulletin of the United States Fish Commission Seattlenwf