Implications of Punctuation Mark Normalization on Text Retrieval. Doctor Of

IMPLICATIONS OF PUNCTUATION MARK NORMALIZATION ON TEXT RETRIEVAL Eungi Kim, B.S., M.S. Dissertation Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS August 2013 APPROVED: William E. Moen, Major Professor Brian C. O’Connor, Committee Member Christina Wasson, Committee Member Suliman Hawamdeh, Chair of the Department of Library and Information Sciences Herman Totten, Dean of College of Information Mark Wardell, Dean of the Toulouse Graduate School Kim, Eungi. Implications of punctuation mark normalization on text retrieval. Doctor of Philosophy (Information Science), August 2013, 278 pp., 41 tables, 20 figures, references, 94 titles. This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate punctuation normalization issues, various frequency counts of punctuation marks and punctuation patterns were conducted using the text drawn from the Gutenberg Project archive and the Usenet Newsgroup archive. A number of useful punctuation mark types that could aid in analyzing punctuation marks were discovered. This study identified two types of punctuation normalization procedures: (1) lexical independent (LI) punctuation normalization and (2) lexical oriented (LO) punctuation normalization. Using these two types of punctuation normalization procedures, this study discovered various effects of punctuation normalization in terms of different search query types. By analyzing the punctuation normalization problem in this manner, a wide range of issues were discovered such as: the need to define different types of searching, to disambiguate the role of punctuation marks, to normalize whitespaces, and indexing of punctuated terms. This study concluded that to achieve the most positive effect in a text retrieval environment, normalizing punctuation marks should be based on an extensive systematic analysis of punctuation marks and punctuation patterns and their related factors. The results of this study indicate that there were many challenges due to complexity of language. Further, this study recommends avoiding a simplistic approach to punctuation normalization. Copyright 2013 By Eungi Kim ii ACKNOWLEDGEMENTS First and foremost, I want to thank God for giving me the strength and perseverance to finish this journey. I wish to express my sincere gratitude and appreciation to everyone who contributed to this dissertation. In particular, I would like to thank my committee members for their patience and unyielding support. I’m grateful to my dissertation chair Dr. William Moen. He assisted with every aspect of completing this dissertation. We’ve had numerous detailed discussions, and his thoughtful comments were always useful. My other committee members, Dr. Brian O’Connor and Dr. Christina Wasson, whom I have known for all these years, contributed their time generously in serving on my proposal and dissertation defense. They offered me with intellectually stimulating comments, and their valuable comments helped me to avoid making a number of potential critical mistakes in this dissertation. I am also indebted to Dr. Shawne Miksa, Ph.D. Program Coordinator at University of North Texas. She always provided me with assistance and guidance whenever I needed. I would also like to thank Dr. Suliman Hawamdeh, Dr. Linda Schamber, and Dr. Victor Prybutok for their roles in granting extension to complete my Ph.D. dissertation. I also want to thank Malaina Mobley, other staff members, and the faculty of The Library and Information Sciences Department at the University North Texas for their assistance in various matters. Finally, this dissertation would not have been possible without the support of my family members. I would like to thank my wife Eunhee for her personal support and great patience. My precious two daughters, Minhae and Ester, have been patience with me for all these years. My sister Eunsil has been a reliable support of my life. I cannot thank enough to my father Young Jong Kim and my mother Sook Ja Kim for their endless love and support. They helped me with everything I needed to continue my education and complete this dissertation. iii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ........................................................................................................... iii LIST OF TABLES ........................................................................................................................ vii LIST OF FIGURES ....................................................................................................................... ix CHAPTER 1 INTRODUCTION .....................................................................................................1 The Notion of Punctuation Normalization ...........................................................................6 Research Motivation ..........................................................................................................12 Theoretical Framework ......................................................................................................15 The Linguistic (Computational) Perspective .........................................................15 Information Retrieval (Text Retrieval) ..................................................................17 Information Extraction (IE) ...................................................................................20 Integrated Theoretical Framework .........................................................................22 Research Questions ............................................................................................................26 Research Design and Data Collection................................................................................27 Limitation of the Study ......................................................................................................28 Organization .......................................................................................................................29 CHAPTER 2 BACKGROUND AND LITERATURE REVIEW .................................................30 Linguistic and Computational Analysis of Punctuation Marks .........................................30 Punctuation Mark Categorization ..........................................................................32 Frequency Analysis of Punctuation Marks and Punctuation Pattern .....................35 Punctuation Marks in Text Processing Related Procedures ..............................................40 Tokenization ..........................................................................................................40 Indexing and Punctuation Marks ...........................................................................43 Effectiveness of Search Queries and Punctuation Normalization .........................45 Existing Approaches to Punctuation Normalization Practices ..........................................47 Google Search Engine............................................................................................48 LexusNexis ............................................................................................................52 Concepts Related to Punctuation Normalization ...............................................................54 Character Encoding and Character Normalization ................................................54 iv Normalization of Capitalized Words .....................................................................57 Punctuation Marks in Forms of Words and Expressions ...................................................58 Abbreviations and Non-Standard Words ...............................................................58 Non-Standard Expressions .....................................................................................62 Chapter Summary ..............................................................................................................62 CHAPTER 3 METHODOLOGY AND RESEARCH DESIGN ...................................................64 Overview of Research Strategy .........................................................................................64 The Analysis Setting ..........................................................................................................66 Methodological Procedures ...............................................................................................73 Highlights of Methodological Procedures .........................................................................74 Tokenization (Segmentation) .................................................................................75 Frequency Counts and Punctuation Mark Types ...................................................80 Additional Types of Punctuation ...........................................................................81 Punctuation Normalization Types ..........................................................................84 Types of Search Queries and Punctuation Normalization Effects .........................86 The Normalization Demonstration ........................................................................87 Methodological Limitation ................................................................................................89 Chapter Summary ..............................................................................................................90

Implications of Punctuation Mark Normalization on Text Retrieval. Doctor Of

Edit Bibliographic Records

Beginning Portable Shell Scripting from Novice to Professional

Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim, Jin Hyung Kim

Writing Mathematical Expressions in Plain Text – Examples and Cautions Copyright © 2009 Sally J

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Dictation Presentation.Pptx

Agricultural Soil Carbon Credits: Making Sense of Protocols for Carbon Sequestration and Net Greenhouse Gas Removals

Czech and Slovak Typesetting Rules

A Collection of Mildly Interesting Facts About the Little Symbols We Communicate With

The Selnolig Package: Selective Suppression of Typographic Ligatures*

AGU Grammar and Style Guide

Supplemental Punctuation Range: 2E00–2E7F