Layout Inference
Total Page:16
File Type:pdf, Size:1020Kb
LAYOUT INFERENCE: FILE SCHEMA RECOGNITION VIA CONTENT-BASED ORACLES LAYOUT INFERENCE: FILE SCHEMA RECOGNITION VIA CONTENT-BASED ORACLES A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy By Reid A. Phillips University of Arkansas, 2006 Master of Science in Computer Science August 2009 University of Arkansas ABSTRACT Some organizations routinely (e.g., monthly) process tens of thousands of flat files, files consisting of records containing a fixed number of fields, received from third parties. Currently, the process of characterizing each file’s encoding, formatting elements, structure, and content is a manual process, expensive in that the process costs human time, delays processing the files, and is error prone. This dissertation provides methods for automatically inferring the specified meta data associated with these files. In order to mine, persist, transform, or in some other way process structured data contained within a flat file, the properties associated with a file must first be known. Within this paper, the identification of these properties will be referred to as the layout inference problem, where a layout is a specification of the characteristics associated with a file. Typically a manual task, layout inference can benefit from an automated tool designed to replace or assist human involvement in this process. In defining the result of this process, the layout, the first step is to identify the properties to be inferred. These characteristics are requisite to read and process the contents of a file and include but are not necessarily limited to: the schema of the data records contained within a file, the character encoding, and other formatting details. Thus layout inference is concerned with providing an encompassing description of a file rather than a single characteristic (e.g., only the character encoding). Once available, the final step in the layout inference problem is to communicate the produced layout in a meaningful manner to any interested parties. The approach to this problem described in this paper is primarily statistical in nature. Statistical solutions, while potentially more ambiguous, can be considered to be better than other solutions because they are more adaptive: gracefully handling a limited amount of error and incomplete information along with many unforeseen circumstances. Another important characteristic of the approach detailed herein is a conglomeration of expert agents. These agents provide the means for identification of the file properties as each agent is an expert concerning a respective property. By applying their respective knowledge in various ways, as appropriate with respect to the property being determined, the various layout characteristics may be inferred. Together the statistical results of expert knowledge agents provide a powerful approach to solving the layout inference problem. The applicability of this approach towards the layout inference problem will be shown through results generated by an implemented prototype. These results will indicate the prototype’s performance (i.e., accuracy and run time) with respect to a representative set of data files; consequently showing the ability of the defined approach and the promise related to certain areas of future work. This thesis is approved for recommendation to the Graduate Council. Thesis Directors: ____________________________________ Wing-Ning Li ____________________________________ Craig Thompson Thesis Committee: ____________________________________ Gordon Beavers ____________________________________ David Douglas THESIS DUPLICATION RELEASE I hereby authorize the University of Arkansas Libraries to duplicate this thesis when needed for research and/or scholarship. Agreed ________________________________________ Refused ________________________________________ ACKNOWLEDGEMENTS I thank the Lord, Jesus Christ, for granting me the ability and endurance necessary for this project and my educational career. I am grateful to my parents, Timothy and Charmaine Phillips, for their continuous support. In all aspects of my life their assistance and guidance has been a wonderful blessing. I thank my advisors, Drs. Craig Thompson and Wing-Ning Li, for their advice and direction concerning both this thesis and graduate research. It is unlikely I would have reached this point without their continual patience, guidance, and support. I also appreciate Drs. Gordon Beavers and David Douglas: Dr. Beavers for his involvement in this research effort and both for their assistance as committee members. I also thank Acxiom Corporation for sponsoring me in present and past research projects, especially for suggesting the Layout Inference problem and providing guidance and test data. I especially thank Jonathan Loghry and David Nash for sharing their understanding of the problem and reviewing this work. Finally, I would like to also thank Wesley Deneke and Patrick Benham who have both directly contributed to this project. Your assistance has had a significant impact. Other friends, who have contributed in a less direct, yet no less meaningful way, include: Evan Kirkconnel, Jonathan Fleming, Kyle Stacey, Adam Vermillion, Nick LaSorte, Jonathan Schisler, and John Dixon. I appreciate you all. My thesis is dedicated to all of you. v vi TABLE OF CONTENTS 1. Introduction .................................................................................................................. 1 1.1 Problem Definition .................................................................................................. 1 1.2 Thesis Statement ...................................................................................................... 9 1.3 Organization of this Thesis .................................................................................... 10 2. Related Work ............................................................................................................. 11 2.1 Formal Languages .................................................................................................. 11 2.2 Information Extraction and Named Entity Recognition ........................................ 13 2.3 Statistical Learning ................................................................................................ 17 2.4 Tabular Data Recognition ...................................................................................... 19 2.5 Text Mining ........................................................................................................... 20 3. Layout Inference Approach ...................................................................................... 21 3.1 Layout Characteristics ........................................................................................... 21 3.2 Initial Steps ............................................................................................................ 21 3.3 Parsing a Record .................................................................................................... 23 3.3.1 Field Characteristics........................................................................................ 24 3.3.2 Record Length ................................................................................................. 25 3.3.3 Field Content Type ......................................................................................... 31 3.3.4 Results from Evidence .................................................................................... 43 3.4 Reporting the Results ............................................................................................. 43 4. Prototype Description ................................................................................................ 45 4.1 Introduction ............................................................................................................ 45 vii 4.2 Prototype Invocation .............................................................................................. 47 4.1 Setting Parameters ................................................................................................. 47 4.2 Character Encoding ................................................................................................ 51 4.3 Delimiters and File Type ....................................................................................... 52 4.5 Oracles ................................................................................................................... 53 4.5.1 Interfaces ......................................................................................................... 54 4.5.2 Container Class ............................................................................................... 56 4.5.3 Oracle Definitions ........................................................................................... 56 4.5.4 Oracle Implementations .................................................................................. 58 4.6 Record Length ........................................................................................................ 63 4.7 Record Layout Analysis ......................................................................................... 65 4.8 Identifying and Removing False Positives ............................................................ 68 4.8.1 Vertical Analysis ............................................................................................. 68 4.8.2 Conflict Resolution ........................................................................................