<<

ABSTRACT

RECOGNIZING TABLE FORMATTING FROM TEXT FILES

By Venkatprabhu Rajendran

Abstract Some text documents, legacy documents in particular, do not format sections of text containing tables. Documents like these are not as readable as they could be since the columns are not aligned. Non aligned columns prevent the reader from seeing the important patterns in the text. This thesis presents algorithms to help insert table formatting into free text. This algorithm parses the text to identify common syntactic patterns such as dates, dollar amounts, and times. Pattern matching techniques are then used to identify models of what type of data each column should contain. These models are called templates. Ambiguities often exist. These ambiguities make it necessary to rank the alternative templates and their associated tables. This thesis focuses on evaluating the candidate templates and associated tables to rank the different alternatives. The scoring function attempts to mimic the process that a human might go through when performing the same task. The effectiveness of the scoring is evaluated on a set of tables that have appeared in real electronic feeds. RECOGNIZING TABLE FORMATTING FROM TEXT FILES

A Thesis

Submitted to the

Faculty of University

in partial fulfillment of

the requirements for the degree of

Master of Computer Science

Department of Computer Science and Systems Analysis

By

Venkatprabhu Rajendran

Miami University

Oxford, Ohio

2006

Advisor______Dr. Michael Zmuda

Reader______Dr. Valerie Cross

Reader______Dr. Jim Kiper

Table of Contents

1. Introduction...... 1 2. Types of Table Recognition...... 2 2.1 Document Imaging...... 2 2.2 Tabular Recognition from Text ...... 2 3. Previous work...... 4 3.1 The Interpretation of table in texts...... 4 3.2 Table Information Based Text in Query ...... 5 3.2.1 Preprocessing module ...... 5 3.2.2 Indexing module ...... 5 3.2.3 Retrieval module...... 5 3.3 Table Recognition by Reference Tables...... 6 3.4 Table recognition based on Robust Block Segmentation ...... 7 3.5 Learning to recognize Tables in Free Text ...... 7 4. System Concepts...... 9 4.1 Templates, Tokens and Entities ...... 9 4.2 Template Complexity (TX) ...... 14 4.3 Template Consistency (TC)...... 18 4.4 Template Evaluation Score (TES) and Results ...... 25 5 Table Evaluation ...... 29 5.1 Row completeness (Rc) ...... 30 5.2 Column cohesion (CC) ...... 31 5.3 Table recognition - Example...... 33 6. Conclusion ...... 37 7. References...... 39 8. Supporting Papers ...... 40 Appendix A. Sample input tables...... 41

ii Table of Figures

Figure 1 - Sample ASCII document without formatting ...... 1 Figure 2 - Text with extra spacing ...... 3 Figure 3 - Text without extra spacing...... 4 Figure 4 - TINTIN System [PCTINTIN97]...... 6 Figure 5 - Features from [HCJ99]...... 8 Figure 6 - Tokens...... 11 Figure 7 - Entity matrix for the input text ‘1. 755’...... 12 Figure 8 - Entity matrix for the input text ‘Michigan Systems 40 1 / 2’...... 13 Figure 9 - Sample entities for the input text ‘1.Hank Aaron 755’ ...... 14 Figure 10 - Possible Templates...... 16 Figure 11 - Example for complex syntactic entities ...... 17 Figure 12 - Template Complexity value for input text A2 ...... 17 Figure 13 - Template Weight calculation for the input 'Ohio 5/5/2005'...... 18 Figure 14 - Template Consistency - Motivation...... 19 Figure 15 - Template Consistency - Need to make the list small ...... 20 Figure 16 - Template Consistency - input text from A13...... 23 Figure 17 - Templates generated for input text from A13...... 24 Figure 18 - Top 4 templates for all the input text considered...... 28 Figure 19 - Example of table interpretation...... 29 Figure 20 - Row completeness values for the Table A2...... 31 Figure 21 - Unique templates generated for input text A6 ...... 34

iii List of Tables

Table 1 - Information shown with table formatting...... 2 Table 2 - Syntactic entities...... 11 Table 3 - Entity Weights...... 15 Table 4 - Entity inheritance...... 22 Table 5 - Template evaluation score...... 26 Table 6 - One valid table for the input text A1...... 32 Table 7 - Table generated for A6 with template text parenthetical text ...... 35 Table 8 - Table generated for A6 with template text country text ...... 35 Table 9 - Table generated for A6 with template text parenthetical text integer text...... 36 Table 10 - Table generated for A6 with template text country text integer text ...... 36 Table 11 - Tables Scores for input text A6...... 37

iv 1. Introduction Structured data is often presented in a table to improve readability. A large of legacy documents, however, do not display information using tables as the original table constructs are dropped. Thus, it would be useful to be able to automatically reformat certain sections of information into a table to increase the documents’ value. This type of automated tool could also be useful in office automation tools to a user in automatically creating tables (similar to how MS Word automatically recognizes numbered lists and reformats accordingly). Figure 1, for example, contains unformatted text; shown in a proportional font.

1. Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds 2. Thomas Muhlbacher (*) (Austria) 3. Marcel Strauss (Switzerland) 4. Marc Lotz (Netherlands) all same time 5. Ellis Rastelli (Italy) 11 seconds behind 6. Fred Rodriguez (U.S.) 7:31 7. Markus Zberg (Switzerland) 8. Rene Haselbacher (Austria) 9. Richard Virenque (France) 10. Oscar Mason (Italy) all same time

Figure 1 - Sample ASCII document without formatting

The ideal table for the text in Figure 1 is shown in Figure 2. Creating this table manually is easily done by recognizing that the numeric sequence should reside in its own column and the parenthesized country should also be placed in its own column. Finally, the text between these two entities is placed in their own columns.

1

1. Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds 2. Thomas Muhlbacher (*) (Austria) 3. Marcel Strauss (Switzerland) 4. Marc Lotz (Netherlands) all same time 5. Ellis Rastelli (Italy) 11 seconds behind 6. Fred Rodriguez (U.S.) 7:31 7. Markus Zberg (Switzerland) 8. Rene Haselbacher (Austria) 9. Richard Virenque (France) 10. Oscar Mason (Italy) all same time

Table 1 - Information shown with table formatting

2. Types of Table Recognition Table recognition is generally divided into two categories – recognition via document imaging and recognition from free text.

2.1 Document Imaging Document Imaging is the method of recognizing tabular information from document images that are obtained by scanning pages into a digital form [JSNG05]. The problem of table reconstruction then becomes primarily an image-processing problem. The image, a matrix of pixels, is examined for constructs like horizontal and vertical lines and large regions of white space. These graphical entities are used to identify the table rows, columns and as well as table headings. An important aspect here is not just extracting table cells but also maintaining the spatial relationship between them so that the entire table information can be extracted without columns and rows getting merged.

2.2 Tabular Recognition from Text Table recognition from text has been a research topic for researchers [MT96, WTSW98, HCJ99, and JSNG05]. In this area, there are two subcategories of recognizing tables in text: 1) when the text includes extra spaces to help preserve the original alignment. Figure 2 illustrates an example

2 of this type of text. 2) When the text does not attempt to align the columns with extra spaces. Figure 3 illustrates this type of text.

1. Hank Aaron 755 2. 714 3. 660 4. 586 5. 573 6. 563 7. 548 8. 536 9. 534 10. Willie McCovey 521 521 12. 512 Eddie Matthews 512 14. 511 15. 500 16. 493 17. 475 475

19. 465

20. 452

Figure 2 - Text with extra spacing

If the input text does not contain extra spaces, as shown in Figure 3, it is much more difficult to automatically detect what belongs with each column, although a human can easily accomplish this task. Some work has been done in this area using machine learning but obtained limited success [MAZ01].

3 1. Hank Aaron 755 2. Babe Ruth 714 3. Willie Mays 660 4. Frank Robinson 586 5. Harmon Killebrew 573 6. Reggie Jackson 563 7. Mike Schmidt 548 8. Mickey Mantle 536 9. Jimmie Foxx 534 10. Willie McCovey 521 Ted Williams 521 12. Ernie Banks 512 Eddie Matthews 512 14. Mel Ott 511 15. Eddie Murray 500 16. Lou Gehrig 493 17. Stan Musial 475

Figure 3 - Text without extra spacing

3. Previous work Much work has been done to recognize tables and format them and each method has its own pros and cons. The following sections discuss some of the more important work:

3.1 The Interpretation of table in texts Hurst’s work [MFH00] involves developing a technology to process a table in textual documents so that the contents can be accessed, and after that, it could be interpreted by standard information extraction and natural language processing systems. His thesis work consists of three parts: the first part describing generally the table and the research that is going behind the tables, the second part describes the layered model of the table and also provides some notation for encoding tables in these component layers. The final part focuses on the design, implementation and evaluation of a system that actually produces a model for the tables found in a document.

4 3.2 Table Information Based Text in Query TINTIN system [PCTINTIN97] uses heuristic approaches to detect tables from text documents. This method uses three modules. They are the initial preprocessing module, the indexing module and the retrieval module.

3.2.1 Preprocessing module Here the tabular constructs are extracted from the text documents and thus separated from the rest of the data. It consists of two stages. They are: • Table extractor: The white spaces from the contiguous block of text are first extracted. Once the white spaces are identified then a data structure called Character Alignment Graph (CAG) is formed. This graph gives details about the number of characters that appear at a particular location on ‘k’ different lines. The row and columns headings and other captions are extracted using a table extractor. • Component Tagger: Extracted tabular constructs are to be tagged in order to be distinguished. Croft and Reddy have used syntactic heuristics rather than semantic heuristics. Some of their heuristics are Gap Structure heuristic, Alignment heuristic and Pattern Regularity heuristic, Differential Column Count heuristic and Differential Gap heuristic.

3.2.2 Indexing module Field indexing is the process of creating specific terms related to captions and tables that have been tagged in the previous module. The authors have used “inbuild” indexing program for this module. Inbuild is a part of a suite of tools for IR systems developed at the Center for Intelligent Information Retrieval, University of Massachusetts, Amherst. The output of this module is a structured document database.

3.2.3 Retrieval module The authors have used a search engine called INQUERY to query the structured document database. Since the database is field indexed it is easy to query using the fields. The main disadvantage of this methodology is that it distinguishes only between the table captions and table entries. This method can be further improved by distinguishing between table entries, captions as well as row and column headings.

5

ARCHITECTURE OF THE TINTIN SYSTEM1

Document Preprocessing Module Document Docs Database Table Component Extractor Tagger

Indexing INQUERY

Structured Document Database

Retrieval Query

Figure 4 - TINTIN System [PCTINTIN97]

3.3 Table Recognition by Reference Tables [WTSW98] is an example of a knowledge-based approach. Here two types of tables are used, namely the reference tables and the table that has to be recognized. First the new table is specified using a well-designed user interface. Now the algorithm has a new table model. Next when the new table is given as input, it is compared with the tables in the knowledge base to look for a match. The algorithm works by first detecting the header of the reference table. Then it compares this header with headers of the reference tables to find a match. Once a match is found the algorithm continues by detecting the rest of the tabular structure. The main significances of

6 this method is that it works well for tables that either violate the table layout conventions or for tables that contain text that not belong to the table. The main disadvantages are that this method necessitates a reference table to be created. Creating the table increases the space and time complexity. Additionally, this method requires prior information about the tables pertaining to the domain in consideration.

3.4 Table recognition based on Robust Block Segmentation Kieninger [TGK98] implemented the T-Recs system that identifies table columns using spacing information. T-Recs receives text input and identifies word bounding box information and output the cells of the tables. The algorithm starts by selecting a random word as a seed and then draws a vertical stripe over the seed to find the overlapping words on the top and bottom lines. All words overlapped with the seed word constitute a block and are marked as expanded. The algorithm is repeated for each word, and the algorithm continues until there are no words left unmarked.

The main disadvantage of this algorithm is that the algorithm does not work correctly for the header as it does not have overlapping words in the vertical neighbors. Because of this problem, the row would be considered as table elements, and each word would be considered as a column of a table. The second problem is with the words which appear at the end of the non- justified block which mostly not overlap with other words. The third problem is for columns that have a common header. However, the author has come up with some post processing steps for the initial block segment that overcomes these drawbacks. It is worth noting that T-Recs assume that the input has spacing to put the columns in some form of alignment. This assumption is one that is not used in the work proposed here.

3.5 Learning to recognize Tables in Free Text Ng, Lim, and Koo [HCJ99] developed a machine learning approach to solve the problem of recognizing tables from free text. The C4.5 algorithm is used to identify the boundaries of the table first, then followed by detecting the table columns and rows. The authors used an X- window based GUI to quickly annotate a table with its boundary, column and row demarcation. With this information, training examples are created in the form of feature value vectors with correctly assigned classes. Thus, a set of training examples are generated for each subproblem of

7 detecting table boundaries, rows and columns. The feature value vectors differ for each subproblem. The algorithm then takes a new unseen example and classifies it. In the beginning, the features variables are created for each sub problem, and its values depend upon the characters the tables actually embody, for example, the space character, the special characters and the alphanumeric characters. The values for these variables are all predefined. For example the feature values for the table boundary are defined by 9 values as follows:

Feature Description F1 Whether the line consists of only space characters. Values are ‘t’ if H is a blank line or ‘f’ otherwise. F2 Number of leading space characters in the line. F3 The first non-space character in the line. Possible values are (){}[]<>+-*/=!@#$%^& or ‘N’ otherwise F4 The last non-space character in the line. Values are same as F3 F5 Whether the line has only one special character. Values are same as F3 F6 Number of segments in the line with 2 or more contiguous characters. F7 Number of segments in the line with 3 or more contiguous characters. F8 Number of segments in the line with 2 or more contiguous separator characters. F9 Number of segments in the line with 3 or more contiguous separator characters.

Figure 5 - Features from [HCJ99]

Consider the following lines of text:- Line 13: Net tons Capability Line 14: Produced Utilization Line 15: Week to March 14 ………………………… 1,633,000 75.8% Line 16: Week to March 7…………………………... 1,570,000 71.9%

8 The feature values vector generated for the line 16 was ‘f, 3, N, %, N, 4, 3, 1, 1’. The main disadvantage is the inability to distinguish different types of rows like the header or the caption or the table body row.

4. System Concepts

4.1 Templates, Tokens and Entities This section explains the table recognition methodology used in this work. The main concept is that a table has a template associated with it. A template is a sequence of syntactic elements, where each element corresponds to the contents of a particular column in the table. For example, the template associated with the table in Figure 3 would be: integer text integer. This particular template indicates that the table consists of 3 columns, where the first and third columns contain integer data values and the middle column contains arbitrary text data. One of the main challenges of this work is discovering what the correct template actually is.

Parsing tools are used to systematically scan the lines of text looking for the presence of common syntactic elements, referred to as entities. An entity is any contiguous sequence of characters that can be interpreted as a well-known concept. Table 2 lists the entities used in this work. The compiler tool ANTLR [ANTLR06] is used to parse the text in order to identify the presence of these entities. The rules shown in Table 2 include all the common entities found in the set of documents. It is important to note that the set of entities is expandable. That is, other grammar rules could be easily integrated into the design. With an appropriate database, it is also possible to have a company entity, person entity, product entity etc. For example, a database for recognizing valid countries is used.

Grammar rules, which are written in ANTLR’s grammar specification language, describe the syntax of each of the listed entities. In addition to the syntactic structure for each entity, constraints are placed on the attributes to further refine the parsing process. The rules for the clocktime entity are shown below to illustrate the general form of these rules. The first rule shows that a clocktime is composed of an integer followed by a colon followed by an integer. Further, the first integer must be between 0 and 23 (to account for military time) and the second

9 integer must have length two and have a numeric value between 0 and 59. The second rule is designed to capture clocktimes appearing without a specific number of minutes (e.g.3am). We note that this example is a pseudo-code version the actual ANTLR rules.

Æ INT COLON INT { } CONSTRAINT[the first integer is between 0 and 23] CONSTRAINT[the second integer is between 0 and 59] CONSTRAINT[the length of the second integer is two] | CONSTRAINT[the integer is between 1 and 12] Æ“am” | “a.m.” | “AM” | “A.M.” | “pm” | “p.m.” | “PM” | “P.M.” Æ SPC | TAB | Æ Æ | ε

Syntactic Entities Comments / Example Record 13-8-6, 56-9-7 Range 0-3, 4-6 … Real 5.3, 4.3, 65.343 … Integer 10,121,1000 … Fraction 1/2 , 5/4,… Clocktime Noon, 11 P.M , 3:35 A.M Date 1-2, 1-3-1983, 3/3/2000, Jan 27 Dollar $, $0, $45, $34-$56 … Parenthetical ( ), (Italy)… Text 1. , Tom , … Network TNT, AMC … University Ohio, Michigan … State Washington, Ohio, N.Y… City Kansas, oxford…

10 Dollarrange $34-54, $55-$67 … Percentage 20%, 45% … Day Monday, Tuesday…. Country Italy, Austria, France …

Table 2 - Syntactic entities

The parsing rules are applied to all contiguous regions of the text to determine if that section of text corresponds to one or more of the entities. In order to do this, a concept simpler than an entity (called a token) is used. A token is the smallest unit of text that is considered indivisible. The tokens are very basic and are shown in Figure 6

WORD Æ sequence of upper or lowercase letters INT Æ sequence of digits COLON Æ : SHARP Æ # …

Figure 6 - Tokens

The first step is to tokenize each line of text. For example, consider the following two input texts.

Input text 1: 1. Hank Aaron 755

“1. Hank Aaron 755” is tokenized into sequence of eight tokens: < INT:1>< PERIOD:.>< WORD:Hank>< SPC: >< WORD:Aaron>< SPC: >< INT:755>

Input text 2: c

“Michigan Systems 40 1 / 2” is tokenized into sequence of 9 tokens:

Entities are then recognized based on this sequence of tokens. All contiguous sequences of

11 tokens are parsed for the presence of the higher-level entities. This set of entities can be visualized as the upper diagonal matrix shown in Figure 7 and Figure 8 for the two input texts. Each element of the grid represents a particular region of text. For example, the element for row 1 column 2 represents the set of entities that correspond to the sequence of tokens starting with through It is noted that for the sake of conciseness the entity matrix does not include the ubiquitous catch-all entity, text which would appear in every cell.

Ending Token 1 2 3 4 5 6 7 8 INT:1 PERIOD:. SPC: WORD:Hank SPC: WORD:Aaro SPC: INT:755 n 1 integer integer INT:1 2 Start PERIOD:. Token 3 SPC: 4 WORD:Hank 5 SPC: 6 WORD:Aaron 7 SPC: 8 integer INT:755

Figure 7 - Entity matrix for the input text ‘1.Hank Aaron 755’

12 Ending Token 1 2 3 4 5 6 7 8 9 WORD:Michigan SPC: WORD:Systems SPC: INT:40 SPC: INT:1 SLASH:/ INT:2 1 state WORD:Michigan 2 Starting SPC: Token 3 WORD:Systems 4 SPC: 5 integer fraction INT:40 6 SPC: 7 integer fraction INT:1 date 8 SLASH:/ 9 integer INT:2

Figure 8 - Entity matrix for the input text ‘Michigan Systems 40 1 / 2’

One can also view the entities using the original text as shown in Figure 9. The rectangular box in the input text in Figure 9 is used to represent the blank space between the individual data values. In the example discussed below, entities are matched for the input text. After breaking the input into sequence of tokens, they are grouped as shown in Figure 9 to be matched with a particular entity. In Figure 9, the empty spaces are grouped with the data value ‘Hank Aaron’ and are matched with the entity text. Similarly the tokens ‘1’ (an integer) and ‘.’ (a period) are together grouped and matched with a high level entity integer.

13 1. Hank Aaron755 integer text integer Figure 9 - Sample entities for the input text ‘1.Hank Aaron 755’

A primary motivation of this work is to resolve the ambiguity between competing templates. This objective makes it necessary to rank the alternative templates and their associated tables. Evaluating templates requires developing functions that produce good results in a large number of cases (see Appendix A for sample tables). A scoring function was developed to help choose the best templates. This function was developed based on two factors - template consistency and template complexity that are discussed next.

4.2 Template Complexity (TX) The previous examples show that the tokenized input text can be grouped in multiple ways to form different sets of entities. A sequence of entities like “integer text” or “integer text integer” from Figure 9 is called a template. Though multiple templates match the input text, it is important to select few templates that best describe the input text. In order to assist in this process, the concept of entity complexity is developed. One motivation for entity complexity is that complex entities are preferred to simple entity types, since they typically do not randomly occur in text. For example, the repeated occurrence of date in the rows of a table is strong evidence that a date resides in its own column. In order to differentiate complex entities from simple ones, weights are assigned to entities. Weights for the entities are based on the idea that “complex syntactic entities are likely indicators of good templates”. The values in the second column of Table 3 represent the weights associated with each entity. For simplicity, values of the weights are chosen between 0 and 1. Common syntactic entities like integer and text are assigned small weights (integer – 0.4 and text – 0.01 respectively); whereas, complex syntactic entities like real, fraction, clocktime and date are assigned large weights (real – 0.8, fraction – 1, clocktime – 1 and date – 1). The rationale for these assignments is that the common entities like integer and text are basic derivatives of the more complex entities. There is no formal basis for selecting these particular weights. Instead, human intuition is used to set these values, which

14 produced reasonable results. The next section defines this function and presents several examples showing how it works.

Syntactic Entities Weights Record 1 Range 1 Real 0.8 Integer 0.4 Fraction 1 Clocktime 1 Date 1 Dollar 1 Parenthetical 1 Text 0.01 Network 0.8 University 0.8 State 0.8 City 0.8 Dollarrange 1 Percentage 1 Day 0.8 Country 0.8

Table 3 - Entity Weights

Figure 10 shows a several other example text fragments and their possible templates. It is often the case that entities like integer and text are embedded in other complex entities like real, fraction, clocktime, date, and dollar. The template shown in bold indicates preferred template, which has, in general, a high complexity weight.

Text Possible Templates 5.3 5Æ integer .3Æ text 5.Æ text 3Æ integer 5.3 Æ real 5:45 5Æ integer :45Æ text 5: Æ text 45 Æ integer 5:45 Æ clocktime

15 5/5/2005 5/5/Æ text 2005 Æinteger 5 Æ integer /5/2005 Æ text 5/5 Æ fraction /2005 Æ text 5 Æ integer /5/ Æ text 2005 Æ integer 5 Æ integer / Æ text 4 Æ integer / Ætext 2005 Æ integer 5/5/2005 Æ date

Figure 10 - Possible Templates

We score the complexity of a template by adding the individual weights of the entities and dividing by the number of entities in the template, giving the average complexity of the entities:

TX = (weight of all entities) / (# of entities)

For example, consider a second example ‘Central Garden & Pet Co. (Lafayette) 9 1/2 18 +89’. Several templates consistent with this text are shown inside the rectangular boxes in Figure 11.

16

Central Garden & Pet Co. (Lafayette) 9 1/2 18 +89 (input text)

text integer text integer text text integer text text city integer fraction template 1 template 2 integer text template 3

Figure 11 - Example for complex syntactic entities

Among these three templates, template 1 and template 2 contain only basic syntactic entities like integer and text; whereas, template 3 contains complex entities like city and fraction. Hence, template 3 should be given more weight when compared to the other two. Using the weights in Table 3 we compute the template’s total average weight. The weight of the template 3 is 2.62 which is higher than 0.83 for template 1 and 0.42 for template 2. This relative ranking of these scores is desirable.

Consider the input text shown in previous example. The complexity value of the templates is shown below in Figure 12. It can be seen that template 3 is ranked higher than templates 1 and 2.

Template TX

text integer text integer text (0.01+0.4+0.01+0.4+0.01) / 5 = 0.166

text integer text (0.01+0.4+0.01) / 3 = 0.14

text city integer fraction integer text (0.01+0.8+0.4+1+0.4+0.01) / 6 = 0.436

Figure 12 - Template Complexity value for input text A2

17

For a second concrete example, the input text Ohio 5/5/2005 is used. Only two templates are considered here: text date and text integer text integer. Figure 13 shows the template complexity scores for the two templates. The weights associated with the templates are calculated by adding the weights of the entities that make up the template. We can see from this figure that weight of template text date is higher than the template text integer text integer and hence the template text date is preferred over the other.

text date (0.01 + 1)/2 = 1.01/2 = .50 Ohio 5/5/2005

text integer text integer (0.01 +0.4+0.01+ 0.4)/5 = 0.16

Figure 13 - Template Weight calculation for the input 'Ohio 5/5/2005'

4.3 Template Consistency (TC) Template consistency of a template is defined as the percentage of matches across all the rows of the input text. Each template of a row of the input text is compared with all the templates of all other rows, to look for templates that consistently appear in other rows. After comparing a template with all the templates of rows, the total count of the number of matches is divided by the number of rows of the input text. This calculation gives the value of the template consistency of the given template. Each line of incoming text can be interpreted in more than one way, since a line typically has more than one template associated with it. Moreover, within a row, ambiguities occur since overlapping sections can be interpreted in multiple ways. For example the text “IBM 40 1/4” can have the templates as shown in Figure 14:

18

text integer text integer text

text integer text fraction

IBM 40 1/4

text fraction

text integer text date

text

Figure 14 - Template Consistency - Motivation

The correct template text fraction is easy for humans to select since we understand the context of the text. An algorithm, however, will need to compare these 5 templates with those found in other rows. If the next line were “AAPL 30”, then one of its templates would be: text integer. When considered with the previous line (IBM 40 1/4), there is evidence that the correct table has two columns with the first containing text and the second containing numeric information. With many templates formed for each row, the algorithm must select a small number of templates that are consistent with all (or most) of the rows and that fit all of the text into a table conforming to each template. In other words, the process determines what the table would look like if a particular template is used. For example, consider the input text of Table A2 from Appendix A. The four templates that were generated for each of the inputs are shown in Figure 15. The check marks indicate that the template matches that line of input text.

19

Input text Templates integer text integer text integer text text integer

1. Hank Aaron 755 √ √ √ √ 2. Babe Ruth 714 √ √ √ √ 3. Willie Mays 660 √ √ √ √ 4. Frank Robinson 586 √ √ √ √ 5. Harmon Killebrew 573 √ √ √ √ 6. Reggie Jackson 563 √ √ √ √ 7. Mike Schmidt 548 √ √ √ √ 8. Mickey Mantle 536 √ √ √ √ 9. Jimmie Foxx 534 √ √ √ √ 10. Willie McCovey 521 √ √ √ √ Ted Williams 521 √ √ 12. Ernie Banks 512 √ √ √ √ Eddie Matthews 512 √ √ 14. Mel Ott 511 √ √ √ √ 15. Eddie Murray 500 √ √ √ √ 16. Lou Gehrig 493 √ √ √ √ 17. Stan Musial 475 √ √ √ √ Willie Stargell 475 √ √ 19. Dave Winfield 465 √ √ √ √ 20. Carl Yastrzemski 452 √ √ √ √ Template Consistency 0.85 (17/20) 0.85 (17/20) 1 (20/20) 1 (20/20)

Figure 15 - Template Consistency - Need to make the list small

20 For the above example with 20 rows, the total number of unique templates generated is 4. Typical tables, however, have a larger number of unique templates and thus, it becomes imperative to eliminate templates which are not good candidates. For example, templates that are unique to only one or two rows are unlikely descriptions for the table data. We rank the templates based on their consistency value. The formula for calculating the template consistency of a template is given by:

TC = (# of matches) / (# of rows) The value of the consistency ranges between 0 and 1. The template that is consistent with all the rows is rated higher than the ones that match few others. Generally, better templates will be consistent with a majority of the rows. The steps involved in calculating the consistency value are as follows:

The first step is to calculate the # of matches. Once the templates are constructed, each template is compared across all the templates in all other rows to find the potential matches. The matching of templates is done in two different ways.

First a template is compared with all the templates from other rows for an exact match. Two templates match if and only if their lengths are same and the corresponding entities are same.

Second, when a particular row does not have an exact template match, we check for the presence of a template with entities that could be an inherited type of the given template entities. For example, if the given template is “text integer dollarrange university’ then the template” text integer dollar state” is considered as a match for the given template, since dollar is an inherited entity of dollarrange (e.g., $100-$200 and $150 should match) and state is an inherited entity of the university (e.g., should be an allowable match to University). This process is reasonable since it is often the case that dollar and integer data entities are used interchangeably. For example, consider some text from table A3 of Appendix A:

1992 Hurricane Andrew $15.5 billion 1989 Hurricane Hugo 4.2 billion …………

21

The first row of the input text has a $ value ($15.5 billion) followed by other rows having just an integer value without the $ sign (4.2 billion). Of the templates that were created for these inputs, only the first row had the dollar entity to represent $ 15.5 billion. The other rows did not include the dollar entity. It is common for rows to switch syntax (i.e. a data can be integer in one row and real in another). In order to capture these kinds of discrepancies, this system takes into consideration the inherited entity type concept. So while calculating the template consistency value, the template integer text dollar text that was generated for first row and the template integer text integer text generated for the other rows can be considered to be the same. Table 4 lists the entities and their respective inherited entities used in this work.

Entity Inherited Entities Range {integer, real, dollar, dollarrange} Real {integer, dollar, dollarrange} Integer {fraction, real} Fraction {integer} Dollar {dollarrange, integer, real} University {state, city} State {university} City {university} Dollar Range {dollar}

Table 4 - Entity inheritance

From the above formula, the 4 unique templates of A2 are shown in the last row of Figure 15. The last row gives the template consistency values for all the four templates.

As second example, consider the text shown in Figure 16. The number of unique templates generated for this text is 41. Figure 17 shows the list of all the 41 templates generated with their corresponding template consistency values. Out of these templates the template “text dollar text range text” is consistent with all the five rows, whereas the template “text record text” is

22 consistent only for three rows. The consistency value for the first template is 1.0 (5/5) whereas for the second it is 0.6 (3/5). So, the first template is preferred than the second one with regard to consistency (but not necessarily complexity). It can also be seen that more than one template has the same template consistency value. Thus the template consistency value alone is not enough to choose the best template, but must be combined with the complexity value. The next section discusses this.

A+ Auto $2,750 - 3,250 B Jim Smith's Professional Auto Body $2,999 C Marco's Body Shop $219.95 - 750 C+ Pro Auto Paint Center $3,000 - 5,000 B Steve's Auto Body of $199.95 B

Figure 16 - Template Consistency - input text from A13

S.No Templates TC 1 text 1.0 2 text integer integer text integer text 0.2 3 text city text integer text integer text 0.2 4 text city text integer text 0.2 5 text city text integer integer text 0.2 6 text city text 0.2 7 text dollar integer text integer text 0.2 8 text city text dollar text integer text 0.2 9 text city text real text 0.2 10 text integer record text 0.2 11 text city text dollar text 0.2 12 text integer text integer text integer text integer text 0.4

13 text city text dollar integer text 0.2

23 14 text integer integer text 0.205 15 text dollar text integer text integer text integer 0.4 text 16 text dollar record text 0.2 17 text integer text record text integer text 0.4 18 text integer text integer text integer text 0.6 19 text record text integer text 0.4 20 text dollar text record text integer text 0.4 21 text integer text 1 22 text dollar integer text 0.4 23 text integer range text 0.4 24 text integer text range text integer text 0.6 25 text dollar text integer text integer text 0.6 26 text integer text integer text 1 27 text integer text record text 0.6 28 text dollar range text 0.4 29 text record text 0.6 30 text dollar text range text integer text 0.6 31 text dollar text record text 0.6 32 text real text integer text 1 33 text real text 1 34 text integer text range text 1 35 text dollarrange text integer text 1 36 text range text integer text 1 37 text dollar text integer text 1 38 text dollar text 1 39 text dollarrange text 1 40 text range text 1 41 text dollar text range text 1

Figure 17 - Templates generated for input text from A13

24

4.4 Template Evaluation Score (TES) and Results The evaluation score of a template is calculated as product of template consistency and template complexity. This was developed because we favor templates that have high values for both TC and TX. The product of TC and TX proved to be a useful function to express this property and was used to rank the templates. For example consider the input text A2 from Appendix A: 1. Hank Aaron 755 2. Babe Ruth 714 3. Willie Mays 660 4. Frank Robinson 586 5. Harmon Killebrew 573 6. Reggie Jackson 563 7. Mike Schmidt 548 8. Mickey Mantle 536 9. Jimmie Foxx 534 10. Willie McCovey 521 Ted Williams 521 12. Ernie Banks 512 Eddie Matthews 512 14. Mel Otto 511 15. Eddie Murray 500 16. Lou Gehrig 493 17. Stan Musial 475 Willie Stargell 475 19. Dave Winfield 465 20. Carl Yastrzemski 452

The total number of unique templates generated for this input texts is 4. Table 5 gives the list of templates with their corresponding evaluation scores.

25

Template TC TX TES = TC * TX text 1 0.01 0.01 integer text 0.85 0.205 0.17425 text integer 1 0.205 0.205 integer text integer 0.85 0.27 0.2295

Table 5 - Template evaluation score

In table 5, the columns TC, TX, and TES represent the Template Consistency, Template

Complexity and Template Evaluation Score values respectively. The value of 1 for TC for templates text and text integer indicate that they are consistent with all the 20 rows. We see that the template integer text integer has the highest complexity value of 0.27. The template integer text integer is considered as the best overall template since it has the highest evaluation score of 0.2295. It is interesting to note that even though the template consistency value is less for this template; its complexity value is higher which makes it better than the others. The equations for

calculating the TC, TX, and TES are meant to offset the relative advantage of template consistency and template complexity.

The process of constructing tables for all generated templates generated would be very time consuming. This is because on an average a total of 25 unique templates are formed for each table and constructing that many tables for evaluation would be time consuming. Thus, we desire to reduce the number of templates to a small number. After calculating the evaluation scores of all the templates, the next task is to select the best templates and construct the tables

using the input text. First, all the unique templates are sorted based on the evaluation score TES. From the experiments conducted on 15 input texts in Appendix A, we noted that in most cases there was more than one template having the same rank for a particular input text. After ranking the templates then it is important to choose the best templates to construct the table.

We have observed that the top 4 templates contain the correct one in all but one case (A11). Taking this factor into consideration, the first four highly ranked templates are selected for constructing the tables and the others are disregarded. Figure 18 lists the top 4 templates with

26 their corresponding TES values for all the tables considered. (Refer Appendix A for the list of all input text). The bold face templates are the correct template (from our judgment). Note that the desired template ranks in the top 4 in all but one case (A11). This gives an accuracy of 93% when considering the top 4 templates. For all, the top four templates do contain a reasonable alternative: text parenthetical text

Input text Top 4 selected templates TES Holiday / Cinemax, Sunday, noon text clocktime 0.505 ……… text day text clocktime 0.455 (A1) text network text clocktime 0.455 text network text day text clocktime 0.439 1. Hank Aaron 755 integer text integer 0.229 ……. text integer 0.205 (A2) integer text 0.174 text 0.010 1992 Hurricane Andrew $15.5 billion text dollar 0.505 ……… integer text dollar 0.470 (A3) text real 0.405 integer text real 0.403 Cuba 3-4-6 13 country text date text 0.455 ……… country text range text 0.455 (A4) country text record text 0.455 country text record text integer 0.454 12:30 p.m.: Arizona at (ESPN) clocktime text parenthetical 0.670 ….. clocktime text university text parenthetical 0.564 (A5) clocktime text state text parenthetical 0.564 clocktime text university text university text 0.518 parenthetical Markus Zberg (Switzerland) text parenthetical text 0.340 ..... text country text 0.273 (A6) text parenthetical text integer text 0.085 text country text integer text 0.073 $0 to $14,999: 20% dollar text percentage 0.670 ….. dollar text dollar text percentage 0.604 (A7) dollar text 0.505 text dollar text percentage 0.505

27

David Cone 1-1 0.64 text record text real 0.455 ….. text range text real 0.455 (A8) text real 0.405 text range text integer integer 0.364 Drama series: "ER," NBC. text network text 0.260 ….. text quote text network text 0.158 (A9) text state text network text 0.088

text state text 0.074 Essex: 12 Sept Sussex (h), 19 Sept Glamorgan(h) text parenthetical text parenthetical 0.505 ….. (A10) text date text parenthetical 0.505 text parenthetical text date text parenthetical 0.505 text date text parenthetical text parenthetical 0.505 Essex (5) 15 8 3 4 241 text parenthetical text integer 0.355 ….. text parenthetical text 0.340 (A11) text parenthetical text integer text integer 0.305 text parenthetical text integer text 0.286 text parenthetical text integer text integer 0.255 text integer text 6. Lily White, Susan Isaacs integer text 0.205 ….. text university 0.040 (A12) integer text university 0.040 text 0.010 A+ Auto $2,750 - 3,250 B text dollar text range text 0.406 ….. text range text 0.340 (A13) text dollarrange text 0.340 text dollar text 0.340 Gloria So, 30; Jeffery Lloyd, 30. text integer text integer text 0.166 ….. text integer 0.145 (A14) text integer text integer 0.145 text integer text 0.140 1 April 7, 1977 integer text date text 0.355 ….. text date text 0.340 (A15) integer text date text integer text 0.217 integer text integer text date text 0.217

Figure 18 - Top 4 templates for all the input text considered

28

5 Table Evaluation For each of the 4 templates, the table from the input text is constructed then as to evaluate how well the text fits into a table with a given template. Tables are constructed by matching each line of input text with the selected template. Each row of the input text is broken down into entities and matched with the individual entities of the selected 4 templates. For example, the template “text dollarrange text” would match with the text “A+ Auto $2,750 – 3,250 B” in the following way: Text=”A+ Auto“; Dollarrange=”$2,750 – 3,250”; Text=” B”. Once this is done for all rows, the candidate table can be evaluated.

Constructing the table in this way is not trivial because often the given text will not match the template. While it is desirable to have all columns filled, there are situations where empty columns are acceptable. For example, consider the following input text from A2 and the template integer text integer: 1. Hank Aaron 755 2. Babe Ruth 714 Ted Williams 521 The table interpretation for this template is shown in Figure 19. This figure shows column 1 element is empty for the input text ‘Ted Williams 521’, since the input does not have text to match the entity ‘integer’. It becomes necessary to evaluate tables in a way that is tolerant to empty cells and other ambiguities. Tables with no empty columns should be scored higher than the ones with empty columns.

integer text integer 1. Hank Aaron 755

2. Babe Ruth 714

Ted Williams 521

Figure 19 - Example of table interpretation

29 Evaluating the associated tables is done with another scoring function. This scoring function is based on the concepts of row completeness and column cohesion i.e. tables column must be homogenous. Row completeness is a criterion that determines whether the generated rows in the table are complete. That is, zero (or very few columns) are empty. The details of row completeness and column cohesion and are discussed below.

5.1 Row completeness (Rc)

Row completeness is a criterion that determines whether the generated rows in the table are complete (i.e. all columns contain data that matches the corresponding entity). Row completeness is a measure of how well a particular template realizes the rows for the table. The formula for calculating row completeness is defined as:

Row completeness (Rc):∑rows (# of columns containing text) / (# of columns ) For example, consider the input text of Figure 19. It is clear that the first column of the third row is empty because the input text does not have a matching text for the integer column. Figure 20 gives the row completeness values for all the rows of the input text from A2.

Input text Template Rc integer text integer 1. Hank Aaron 755 1. Hank Aaron 755 1 2. Babe Ruth 714 2. Babe Ruth 714 1 3. Willie Mays 660 3. Willie Mays 660 1 4. Frank Robinson 586 4. Frank Robinson 586 1 5. Harmon Killebrew 573 5. Harmon Killebrew 573 1 6. Reggie Jackson 563 6. Reggie Jackson 563 1 7. Mike Schmidt 548 7. Mike Schmidt 548 1 8. Mickey Mantle 536 8. Mickey Mantle 536 1 9. Jimmie Foxx 534 9. Jimmie Foxx 534 1 10. Willie McCovey 521 10. Willie McCovey 521 1 Ted Williams 521 Ted Williams 521 0.667 12. Ernie Banks 512 12. Ernie Banks 512 1

30 Eddie Matthews 512 Eddie Matthews 512 0.667 14. Mel Ott 511 14. Mel Ott 511 1 15. Eddie Murray 500 15. Eddie Murray 500 1 16. Lou Gehrig 493 16. Lou Gehrig 493 1 17. Stan Musial 475 17. Stan Musial 475 1 Willie Stargell 475 Willie Stargell 475 0.667 19. Dave Winfield 465 19. Dave Winfield 465 1 20. Carl Yastrzemski 452 20. Carl Yastrzemski 452 1

Figure 20 - Row completeness values for the Table A2

5.2 Column cohesion (CC) Column cohesion is a factor that determines how well the generated column’s contents, over all the rows, actually fall under the correct column. Column cohesion is similar to the concept of column homogeneity. The main motivation for considering this factor is that the row completeness factor alone is not enough to select the best table. Row completeness simply favors templates that have data in a matching cell. Occurring ambiguities make it necessary to differentiate between the alternatives. To identify which alternative is the best interpretation, the concept column cohesion was developed. Consider table A1 from Appendix A:

Holiday / Cinemax, Sunday, noon On the Waterfront / TNT, Sunday, 11 p.m. In Cold Blood / TMC, early Monday, 3:35 a.m. GoodFellas / Cinemax, Monday, 9:30 p.m. Sullivan's Travels / AMC, early Thursday, 3:15 a.m. Love Me Tonight / AMC, Thursday, 1 p.m.

After evaluating all the templates, the template “text clocktime” is selected as one of the best templates and a table is constructed with the above table texts as shown in Table 6.

31 text clocktime Holiday / Cinemax, Sunday, noon On the Waterfront / TNT, Sunday, 11 p.m. In Cold Blood / TMC, early Monday, 3:35 a.m. GoodFellas / Cinemax, Monday, 9:30 p.m. Sullivan's Travels / AMC, early Thursday, 3:15 a.m. Love Me Tonight / AMC, Thursday, 1 p.m.

Table 6 - One valid table for the input text A1

Table 6 gives the impression that it has realized all the input texts to the desire table. But actually this table is not the desired one since row 5 is not perfect. The time value “3:15” should also be under the clocktime column and not under text column. Note: “a.m.” is considered a clocktime, as is “3:15 a.m.” It is necessary to identify that the second alternative is better, which can be done comparing it with the contents of the other rows. When comparing the contents of a cell with others in the same column, we have identified the preliminary list of relevant factors: similarity of text length, similarity of case sensitiveness, and column sortedness. These factors are based on the general human intuition of how the data values of a particular column should be.

Numerical, date, or text columns can be checked as to whether they have been arranged in ascending or descending order. The existence of this property is highly suggestive of having the data in the correct column. The text from A4 in appendix A is shown next:

3M - 10/24/96 5.175/160 5.165/165 5.316 171 6M - 1/23/97 5.300/290 5.300/295 5.526 5 1Y - 7/24/97 5.540/525 5.535/530 5.858 445

The values 10/24/96, 1/23/97, and 7/24/97 of the three rows can be considered to be the data values of the first column as they are arranged in ascending order representing days.

At this point, this work only considers the similarity of the length of the data in a column. The function for column cohesion favors columns that have contents with similar lengths. The idea is that columns in which all elements have the same length are thought to be evidence of a correct column. The standard deviation of the lengths is used to assess the column cohesion:

32

Column Cohesion (Cc) =(MAXLENGTH - STDcolumns (length of the columns values))/MAXLENGTH This function favors columns with data that have similar lengths, which makes the standard deviation small.

Finally, tables are scored by adding the two values. Table Score = Row Completeness + Column Cohesion

5.3 Table recognition - Example Having seen how to evaluate the templates and tables, an example will be given that shows the entire algorithm step by step using table A6.

Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) (Austria) Marcel Strauss (Switzerland) Marc Lotz (Netherlands) all same time Ellis Rastelli (Italy) 11 seconds behind Fred Rodriguez (U.S.) 7:31 Markus Zberg (Switzerland) Rene Haselbacher (Austria) Richard Virenque (France) Oscar Mason (Italy) all same time

The above input text is tokenized and then grouped to match with entities to generate the templates. The unique templates generated for this input text is shown below in Figure 21.

Templates TC, TX TES text 1 0.01 0.01 text integer text integer text 0.1 0.166 0.0166

text integer text integer text integer text 0.1 0.177143 0.0177143

text integer 0.1 0.205 0.0205 text integer text integer 0.1 0.205 0.0205 text country text integer text integer text integer text 0.1 0.227778 0.0227778

33 text country text integer text integer text 0.1 0.234286 0.0234286

text parenthetical text integer text integer text integer text 0.1 0.25 0.025

text parenthetical text integer text integer text 0.1 0.262857 0.0262857

text country text integer text integer 0.1 0.271667 0.0271667 text university text 0.1 0.273333 0.0273333 text parenthetical text integer text integer 0.1 0.305 0.0305 text country text integer 0.1 0.305 0.0305 text university text country text 0.1 0.326 0.0326 text parenthetical text integer 0.1 0.355 0.0355 text parenthetical text country text 0.1 0.366 0.0366 text university text parenthetical text 0.1 0.366 0.0366 text integer text 0.3 0.14 0.042 text country text clocktime 0.1 0.455 0.0455 text parenthetical 0.1 0.505 0.0505 text parenthetical text clocktime 0.1 0.505 0.0505 text parenthetical text parenthetical 0.1 0.505 0.0505 text clocktime 0.1 0.505 0.0505 text country text integer text 0.3 0.246 0.0738 text parenthetical text integer text 0.3 0.286 0.0858 text country text 1 0.273333 0.273333 text parenthetical text 1 0.34 0.34

Figure 21 - Unique templates generated for input text A6

The top 4 templates are text parenthetical text, text country text, text parenthetical text integer text and text country text integer text. The HTML tables generated for the top 4 templates are shown below.

34

1. text parenthetical text text parenthetical text Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) (Austria) Marcel Strauss (Switzerland) Marc Lotz (Netherlands) all same time Ellis Rastelli (Italy) 11 seconds behind Fred Rodriguez (U.S.) 7:31 Markus Zberg (Switzerland) Rene Haselbacher (Austria) Richard Virenque (France) Oscar Mason (Italy) all same time

Table 7 - Table generated for A6 with template text parenthetical text

2. text country text text country text Stefano Zanini ( Italy ) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) Austria ) Marcel Strauss ( Switzerland ) Marc Lotz ( Netherlands ) all same time Ellis Rastelli ( Italy ) 11 seconds behind Fred Rodriguez ( U.S. ) 7:31 Markus Zberg ( Switzerland ) Rene Haselbacher ( Austria ) Richard Virenque ( France ) Oscar Mason ( Italy ) all same time

Table 8 - Table generated for A6 with template text country text

35

3. text parenthetical text integer text text parenthetical text integer text Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) (Austria) Marcel Strauss (Switzerland) Marc Lotz (Netherlands) all same time Ellis Rastelli (Italy) 11 seconds behind Fred Rodriguez (U.S.) 7 :31 Markus Zberg (Switzerland) Rene Haselbacher (Austria) Richard Virenque (France) Oscar Mason (Italy) all same time

Table 9 - Table generated for A6 with template text parenthetical text integer text

4. text country text integer text text country text integer text Stefano Zanini ( Italy ) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) ( Austria ) Marcel Strauss ( Switzerland ) Marc Lotz ( Netherlands ) all same time Ellis Rastelli ( Italy ) 11 seconds behind Fred Rodriguez ( U.S. ) 7 :31 Markus Zberg ( Switzerland ) Rene Haselbacher ( Austria ) Richard Virenque ( France ) Oscar Mason ( Italy ) all same time

Table 10 - Table generated for A6 with template text country text integer text

36

Each of the tables is evaluated using the scoring functions discussed in the previous section. The scores for these tables are shown in Table 11.

TEMPLATES TABLE SCORE text country text integer text 37.5 text parenthetical text integer text 27.3 text country text 42.8 text parenthetical text 42.8

Table 11 - Tables Scores for input text A6

Table 11 shows that the template text parenthetical text and text country text have the highest table score value of 42.8. The tables associated with these templates, therefore, are considered to be the best ones. Note that extraneous “(“ and “)” are present in the columns adjacent to the country column (see Table 8). It would be possible to post process the data to remove leading and trailing punctuation in columns.

6. Conclusion In this research, a new technique for reconstructing tabular constructs from free text is presented. This technique is accomplished through the use of scoring functions. These scoring functions do not require alignment information of the input such as the column boundaries, row boundaries, the cell boundaries, size, width and height of the cells, etc. This methodology supports the selection of correct templates to ultimately reconstruct the table. The methodology takes into consideration two important criteria for selecting the correct templates that have not been used previously: template consistency and template complexity.

Other work in this research involves examining this methodology to ensure it selects the correct

templates. Template consistency (TC) and complexity (TX) values were developed to assess

templates. The scoring function (TES) used to evaluate the template was created based on the idea that templates that are consistent with a majority of the rows and exhibit reasonable complexity are good indicators of a good template. Templates are created from sample input texts and are

37 created based on syntactic entities that match the rows of the input text. The matching process also incorporates the concept of entity inheritance. The effectiveness of this methodology was evaluated on a set of tables that have appeared in real electronic feeds. This work uses the compiler tool ANTLR [ANTLR06] to parse the input text in order to identify the presence of entities. Grammar rules were written to describe each entity.

Once a template is selected for consideration, the original text is mapped into a table using the template. A scoring function was developed to evaluate how well the text mapped into that table. This scoring function is based on two important aspects: row completeness and column cohesion. Row completeness favors rows that have data in all cells. Column cohesion makes sure that the data inside a column has similar properties. These properties include similarity of column length, similarity of column case sensitiveness, and the sortedness of a column. The results presented here use only the similarity of column length. This idea of using column cohesion and row completeness to evaluate the tables is novel.

Future work could include: expanding the number of entities in the rule set, identifying the presence of a header row, recognizing rows that span multiple lines in the input and incorporating more features into the column cohesion measure.

38 7. References

1) [ANTLR06] www.antlr.org Terence Parr, “An Introduction to ANTLR”, accessed December, 2005. 2) [HCJ99] Ng, H.T.; Lim, C.Y. and Koo, J.L.T. (1999) "Learning to Recognize Tables in Free Text", Proceedings of the 37th Annual Meeting of ACL, pp 443-450, 1999. 3) [JSNG05] Jiwon Shin and Nick Guerette, “Table recognition and evaluation”, In Proceedings of the Class of 2005 Senior Conference, pp 8-13, 2005. 4) [MT96] Marcy Thomson, “A Tables manifesto”, In Proceedings of SGML Europe 1996, Munich, Germany, pp 51 – 153, May 1996. 5) [MFH00] Mathews Francis Hurst, “The interpretation of Tables in Texts”, Ph.D. Dissertation, The University of Edinburgh 2000. 6) [MAZ01] Zmuda, M. “Recovering Tabular Information from ASCII Documents Using Evolutionary Programming,” Proceedings of the Artificial Neural Networks in Engineering Conference. ASME Press, pp 189-195, 2001. 7) [PCTINTIN97] Pyreddy. P and Croft W. B. “TINTIN: A System for Retrieval in Text Tables”, In Proceedings of the Second ACM International Conference on Digital Libraries, pp. 193-200, 1997. 8) [TGK98] T. G. Kieninger, “Table structure recognition based on robust block segmentation”, In Proc. Document Recognition V, SPIE, volume 3305, San Jose, CA, pp 22-32, January 1998. 9) [WTSW98] W. Tersteegen and C. Wenzel, “Scantab: Table recognition by reference tables”, In Proc. Third Workshop on Document Analysis Systems, Nagano, Japan, 1998.

39 8. Supporting Papers 1) H. Chen, S. Tsai, and J. Tsai, “Mining tables from large scale HTML texts”, 18th International Conference on Computational Linguistics (COLING), pp 166-172, 2000. 2) J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, “Table detection across multiple media”, In Proceedings of the Workshop on Document Layout Interpretation and its Applications, Bangalore, India, 1999. 3) R. Zanibbi and D. Blostein and J.R. Cordy, “A Survey of Table Recognition: Models, Observations, Transformations, and Inference”, 2003. 4) Y. Wang and J. Hu, “A machine learning based approach for table detection on the web”, In The Eleventh International World Web Conference, Honolulu, , USA, 2002. 5) Eugene Agichtein ,Venkatesh Ganti, “Mining reference tables for automatic text segmentation”, Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle USA, pp 20 – 29. 6) Ashwin Tengli, Yiming Yang and Nian Li Ma, “Learning Table Extraction from Examples”, Technical Report, Carnegie Mellon University, 2004. 7) David Pinto, Andrew McCallum, Xen Lee, and W. Bruce Croft, “Table extraction using conditional random fields”, In Proceedings of the 26th ACM SIGIR, 2003.

40 Appendix A. Sample input tables. The following is a list of input texts that are representative of those that will be examined in this work. Input text A1 Holiday / Cinemax, Sunday, noon On the Waterfront / TNT, Sunday, 11 p.m. In Cold Blood / TMC, early Monday, 3:35 a.m. GoodFellas / Cinemax, Monday, 9:30 p.m. Sullivan's Travels / AMC, early Thursday, 3:15 a.m. Love Me Tonight / AMC, Thursday, 1 p.m. Input text A2 1. Hank Aaron 755 2. Babe Ruth 714 3. Willie Mays 660 4. Frank Robinson 586 5. Harmon Killebrew 573 6. Reggie Jackson 563 7. Mike Schmidt 548 8. Mickey Mantle 536 9. Jimmie Foxx 534 10. Willie McCovey 521 Ted Williams 521 12. Ernie Banks 512 Eddie Matthews 512 14. Mel Ott 511 15. Eddie Murray 500 16. Lou Gehrig 493 17. Stan Musial 475 Willie Stargell 475 19. Dave Winfield 465 20. Carl Yastrzemski 452

41

Input text A3 1992 Hurricane Andrew $15.5 billion 1989 Hurricane Hugo 4.2 billion 1994 Northridge, Calif., earthquake 2.5 billion 1992 Hurricane Iniki 1.6 billion 1991 Oakland, Calif., fires 1.6 billion 1989 Loma Prieta, Calif., earthquake 960 million 1993 Southern California wildfires 950 million 1982 Wind, snow, freezing temperatures 880 million 1992 riots 775 million 1979 Hurricane Frederic 752 million Input text A4 12-16-5 33 Russia 13-8-6 27 Germany 3-9-12 24 China 7-6-6 19 France 7-4-7 14 Italy 5-5-4 14 Cuba 3-4-6 13 Australia 3-2-7 12 Poland 5-3-3 11 Hungary 3-2-5 10 Input text A5 Noon: Michigan State at Nebraska (WJLA-7) Noon: Georgia Tech at North Carolina State (WMAR-2) 12:30 p.m.: Arizona at Iowa (ESPN) 3:30 p.m.: Duke at Florida State (WJLA-7, WMAR-2) 3:30 p.m.: Miami (Ohio) at Ball State (WNVT-53) 3:30 p.m.: Kentucky at Cincinnati (HTS) 7 p.m.: Howard at Marshall (WNVT-53) 8 p.m.: UCLA at Tennessee (WUSA-9, WJZ-13) Input text A6 Stefano Zanini (Italy) 4 hours 23 minutes 13 seconds Thomas Muhlbacher (*) (Austria)

42 Marcel Strauss (Switzerland) Marc Lotz (Netherlands) all same time Ellis Rastelli (Italy) 11 seconds behind Fred Rodriguez (U.S.) 7:31 Markus Zberg (Switzerland) Rene Haselbacher (Austria) Richard Virenque (France) Oscar Mason (Italy) all same time Input text A7 $0 to $14,999: 20% $15,000 to $24,999: 15% $25,000 to $49,999: 41% $50,000 to $74,999: 17% $75,000 to $99,999: 5% $100,000 or more: 2% Input text A8 1-2 5.71 Jack McDowell 0-3 8.30 0-1 4.15 1-1 0.64 0-0 25.08 0-0 15.00 Input text A9 Drama series: "ER," NBC. Comedy series: "Frasier," NBC. Miniseries: "Gulliver's Travels," NBC. Television movie: "Truman," HBO. Variety, music or comedy special: "The Kennedy Center Honors," CBS. Variety, music or comedy series: "Dennis Miller Live," HBO. Actor, drama series: Dennis Franz, "NYPD Blue," ABC. Actress, drama series: Kathy Baker, "Picket Fences," CBS. Actor, comedy series: John Lithgow, "3rd Rock From the Sun," NBC. Actress, comedy series: Helen Hunt, "Mad About You," NBC. Actor, miniseries or special: Alan Rickman, "Rasputin," HBO. Actress, miniseries or special: Helen Mirren, "Prime Suspect: Scent of Darkness," PBS. Supporting actor, drama series: Ray Walston, "Picket Fences," CBS. Supporting actress, drama series: Tyne Daly, "Christy," CBS.

43 Supporting actor, comedy series: Rip Torn, "The Larry Sanders Show," HBO. Supporting actress, comedy series: Julia Louis-Dreyfus, "Seinfeld," NBC. Supporting actor, miniseries or special: Tom Hulce, "The Heidi Chronicles," TNT. Supporting actress, miniseries or special: Greta Scacchi, "Rasputin," HBO. Guest actor, drama series: Peter Boyle, "The X-Files: Clyde Bruckman's Final Repose," Fox. Guest actress, drama series: Amanda Plummer, "The Outer Limits," Showtime. Guest actor, comedy series: Tim Conway, "Coach: The Gardener," ABC. Guest actress, comedy series: Betty White, "The John Larroquette Show: Here We Go Again," NBC. Input text A10 Leicestershire: 12 Sept Durham (a), 19 Sept Middlesex (h) Surrey: 12 Sept Glamorgan (a), 19 Sept Worcestershire (h) Derbyshire: 12 Sept Warwickshire (h), 19 Sept Durham (h) Essex: 12 Sept Sussex (h), 19 Sept Glamorgan (h) Kent: 12 Sept Hampshire (h), 19 Sept Gloucestershire (a) Input text A11 Leicestershire (7) 15 8 1 6 248 Surrey (12) 15 8 1 6 247 Derbyshire (14) 15 8 2 5 242 Essex (5) 15 8 3 4 241 Kent (18) 15 8 1 6 233 Yorkshire (8) 15 7 5 3 214 Input text A12 1. Executive Orders, Tom Clancy 2. Servant of the Bones, Anne Rice 3. The Last Don, Mario Puzo 4. The Runaway Jury, John Grisham 5. Out of Sight, Elmore Leonard 6. Lily White, Susan Isaacs 7. Falling Up, Shel Silverstein 8. Cause of Death, Patricia Cornwell 9. How Stella Got Her Groove Back, Terry McMillan 10. The Burning Man, Phillip Margolin Input text A13 A+ Auto $2,750 - 3,250 B

44 Jim Smith's Professional Auto Body $2,999 C Marco's Body Shop $219.95 - 750 C+ Pro Auto Paint Center $3,000 - 5,000 B Steve's Auto Body of Chicago $199.95 B Input text A14 Monica Allen, 25; Mason Hudspeth, 20. Mara Baccus, 19; Matthew Allen, 25. Kristy Cantrell, 24; Brian Thurman, 26, both of Coweta. April Ford, 16; Richard Mestas, 21. Andrea Griffith, 17; Eric Dunn, 20. Constance Hampton, 17; Joshua Taylor, 18, both of Broken Arrow. Sherrie Hicks, 42; Marvin Rankins, 47. Shannon Hixon, 23; Alex Risley, 23. Holly McClung, 29; Michael Biggs, 29, both of Rowlett, . Mary Mermoud, 39; Gary Bridwell, 43. Tracy Miller, 39; Phillip Bailey, 37. Teresa Mongon, 32; Charles Tyler, 29, both of Glenpool. Heather Moretz, 29; Scott Palmer, 23. Diana Naifeh, 42; Ronald Lakey, 45. Thu Nego, 28; Buu Nguyen, 29, of Philadelphia. Lisa Patterson, 19; Jason Miller, 20. Rachel Pierce, 24; Timothy Beck Jr., 24. Nancy Reynolds, 44; Robert Phillips, 46. Rhonda Rice-Dougall, 35; Joseph Nappo, 40, both of Castle Rock, Colo. Gloria So, 30; Jeffery Lloyd, 30. Mary Tramel, 28; Eric Giangreco, 27. Jeanette Walker, 19; Napoleon Manzano, 28, both of Mounds. Jill Wolfenbarger, 26; Clifton Roberts, 25. Tamera Young, 18; Steven Burley, 19. Input text A15 1 April 7, 1977 Bert Blyleven 500 Sept. 9, 1979 Win Remmerswaal 1,000 April 10, 1983 1,500 Sept. 14, 1985 2,000 Sept. 12, 1988

45 2,500 Sept. 30, 1991 Dennis Rasmussen 3,000 June 30, 1995

46