Layout Inference and Table Detection in Spreadsheet Documents
Total Page:16
File Type:pdf, Size:1020Kb
Layout Inference and Table Detection in Spreadsheet Documents Dissertation submitted April 20, 2020 by M.Sc. Elvis Koci born May 09, 1987 in Sarande, Albania at Technische Universität Dresden and Universitat Politècnica de Catalunya Supervisors: Prof. Dr.-Ing. Wolfgang Lehner Assoc. Prof. Dr. Oscar Romero IT BI D C 2 THESIS DETAILS Thesis Title: Layout Inference and Table Detection in Spreadsheet Documents Ph.D. Student: Elvis Koci Supervisors: Prof. Dr.-Ing. Wolfgang Lehner, Technische Universität Dresden Assoc. Prof. Dr. Oscar Romero, Universitat Politècnica de Catalunya The main body of this thesis consists of the following peer-reviewed publications: 1. Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. A machine learning approach for layout inference in spreadsheets. In IC3K 2016: The 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Man- agement: volume 1: KDIR, pages 77–88. SciTePress, 2016 2. Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. Cell classification for layout recognition in spreadsheets. In Ana Fred, Jan Dietz, David Aveiro, Kecheng Liu, Jorge Bernardino, and Joaquim Filipe, editors, Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K ‘16: Revised Selected Papers), volume 914 of Communications in Computer and Information Science, pages 78–100. Springer, Cham, 2019 3. Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. Table identification and reconstruction in spreadsheets. In the International Conference on Advanced Infor- mation Systems Engineering (CAiSE), pages 527–541. Springer, 2017 4. Elvis Koci, Maik Thiele, Wolfgang Lehner, and Oscar Romero. Table recognition in spreadsheets via a graph representation. In the 13th IAPR International Workshop on Document Analysis Systems (DAS), pages 139–144. IEEE, 2018 5. Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. A genetic-based search for adaptive table recognition in spreadsheets. In 2019 International Confer- ence on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pages 1274–1279. IEEE, 2019 6. Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner. DECO: A dataset of annotated spreadsheets for layout and table recognition. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Syd- ney, Australia, September 20-25, 2019, pages 1280–1285. IEEE, 2019 7. Elvis Koci, Dana Kuban, Nico Luettig, Dominik Olwig, Maik Thiele, Julius Gonsior, Wolfgang Lehner, and Oscar Romero. Xlindy: Interactive recognition and informa- tion extraction in spreadsheets. In Sonja Schimmler and Uwe M. Borghoff, editors, Proceedings of the ACM Symposium on Document Engineering 2019, Berlin, Germany, September 23-26, 2019, pages 25:1–25:4. ACM, 2019 This thesis is jointly submitted to the Faculty of Computer Science at Technische Univer- sität Dresden (TUD) and the Department of Service and Information System Engineering (ESSI) at Universitat Politècnica de Catalunya (UPC), in partial fulfillment of the require- ments within the scope of the IT4BI-DC program for the joint Ph.D. degree in computer science (TUD: Dr.-Ing., UPC: Ph.D. in Computer Science). The thesis is not submitted to any other organization at the same time. The author has obtained the rights to include parts of the already published articles in the thesis. 3 4 ABSTRACT Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform var- ious transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual meta- data. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our abil- ity to explore and reuse a great amount of rich data stored into partially-structured doc- uments such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For in- stance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table de- tection in spreadsheets. We target highly diverse sheet layouts, with one or multiple ta- bles and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with prob- lematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets. 5 6 CONTENTS 1 INTRODUCTION 13 1.1 Motivation ..................................... 14 1.2 Contributions ................................... 15 1.3 Outline ....................................... 16 2 FOUNDATIONS AND RELATED WORK 19 2.1 The Evolution of Spreadsheet Documents ................. 20 2.1.1 Spreadsheet User Interface and Functionalities .......... 21 2.1.2 Spreadsheet File Formats ........................ 22 2.1.3 Spreadsheets Are Partially-Structured ................ 23 2.2 Analysis and Recognition in Electronic Documents ............ 23 2.2.1 A General Overview of DAR ...................... 23 2.2.2 DAR in Spreadsheets ........................... 26 2.3 Spreadsheet Research Areas ......................... 26 2.3.1 Layout Inference and Table Recognition .............. 27 2.3.2 Unifying Databases and Spreadsheets ............... 29 2.3.3 Spreadsheet Software Engineering .................. 30 2.3.4 Data Wrangling Approaches ..................... 31 3 AN EMPIRICAL STUDY OF SPREADSHEET DOCUMENTS 33 3.1 Available Corpora ................................ 34 3.2 Creating a Gold Standard Dataset ...................... 36 3.2.1 Initial Selection .............................. 36 3.2.2 Annotation Methodology ........................ 37 3.3 Dataset Analysis ................................. 42 3.3.1 Takeaways from Business Spreadsheets ............... 42 3.3.2 Comparison Between Domains .................... 47 3.4 Summary and Discussion ............................ 50 3.4.1 Datasets for Experimental Evaluation ................. 52 3.4.2 A Processing Pipeline .......................... 52 4 LAYOUT ANALYSIS 55 4.1 A Method for Layout Analysis in Spreadsheets ............... 56 7 4.2 Feature Extraction ................................ 58 4.2.1 Content Features ............................. 58 4.2.2 Style Features ............................... 59 4.2.3 Font Features ............................... 60 4.2.4 Formula and Reference Features ................... 60 4.2.5 Spatial Features .............................. 61 4.2.6 Geometrical Features .......................... 63 4.3 Cell Classification ................................. 63 4.3.1 Classification Datasets .......................... 64 4.3.2 Classifiers and Assessment Methods ................. 65 4.3.3 Optimum Under-Sampling ....................... 66 4.3.4 Feature Selection ............................. 68 4.3.5 Parameter Tuning ............................. 71 4.3.6 Classification Evaluation ......................... 72 4.4 Layout Regions .................................. 79 4.5 Summary and Discussions ............................ 82 5 CLASSIFICATION POST-PROCESSING 83 5.1 Dataset for Post-Processing ........................... 84 5.2 Pattern-Based Revisions ............................. 85 5.2.1 Misclassification Patterns ........................ 86 5.2.2 Relabeling Cells .............................. 87 5.2.3 Evaluating the Patterns ......................... 87 5.3 Region-Based Revisions ............................. 88 5.3.1 Standardization Procedure ....................... 88 5.3.2 Extracting Features from Regions ................... 91 5.3.3 Identifying