TURL: Table Understanding Through Representation Learning
Total Page:16
File Type:pdf, Size:1020Kb
TURL: Table Understanding through Representation Learning Xiang Deng∗ Huan Sun∗ Alyssa Lees The Ohio State University The Ohio State University Google Research Columbus, Ohio Columbus, Ohio New York, NY [email protected] [email protected] [email protected] You Wu Cong Yu Google Research Google Research New York, NY New York, NY [email protected] [email protected] ABSTRACT Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task- specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training/fine- tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning. Specifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. Figure 1: An example of a relational table from Wikipedia. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, a total of 14.1 billion tables in 2008. More recently, Bhagavatula et al. cell filling). We show that TURL generalizes well to all tasks and [4] extracted 1.6M high-quality relational tables from Wikipedia. substantially outperforms existing methods in almost all instances. Owing to the wealth and utility of these datasets, various tasks such as table interpretation [4, 16, 29, 35, 47, 48], table augmentation PVLDB Reference Format: [1, 6, 12, 42, 45, 46], etc., have made tremendous progress in the Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. TURL: Table past few years. Understanding through Representation Learning. PVLDB, 14(3): 307 - 319, However, previous work such as [4, 45, 46] often rely on heavily- 2021. engineered task-specific methods such as simple statistical/language doi:10.14778/3430915.3430921 features or straightforward string matching. These techniques suf- PVLDB Artifact Availability: fer from several disadvantages. First, simple features only capture The source code, data, and/or other artifacts have been made available at shallow patterns and often fail to handle the flexible schema and https://github.com/sunlab-osu/TURL. varied expressions in Web tables. Second, task-specific features and arXiv:2006.14806v2 [cs.IR] 3 Dec 2020 model architectures require effort to design and do not generalize 1 INTRODUCTION well across tasks. Recently, the pre-training/fine-tuning paradigm has achieved Relational tables are in abundance on the Web and store a large notable success on unstructured text data. Advanced language mod- amount of knowledge, often with key entities in one column and els such as BERT [15] can be pre-trained on large-scale unsuper- attributes in the others. Over the past decade, various large-scale vised text and subsequently fine-tuned on downstream tasks using collections of such tables have been aggregated [4, 6, 7, 25]. For task-specific supervision. In contrast, little effort has been extended example, Cafarella et al. [6, 7] reported 154M relational tables out of to the study of such paradigms on relational tables. Our work fills ∗Corresponding authors. this research gap. This work is licensed under the Creative Commons BY-NC-ND 4.0 International Promising results in some table based tasks were achieved by the License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of representation learning model of [13]. This work serializes a table this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights into a sequence of words and entities (similar to text data) and learns licensed to the VLDB Endowment. embedding vectors for words and entities using Word2Vec [28]. Proceedings of the VLDB Endowment, Vol. 14, No. 3 ISSN 2150-8097. doi:10.14778/3430915.3430921 However, [13] cannot generate contextualized representations, i.e., it does not consider varied use of words/entities in different contexts and only produces a single fixed embedding vector for word/entity. Table 1: Summary of notations for our table data. In addition, shallow neural models like Word2Vec have relatively Symbol Description limited learning capabilities, which hinder the capture of complex ) A relational table ) = ¹퐶, 퐻, 퐸, 4C º semantic knowledge contained in relational tables. 퐶 Table caption (a sequence of tokens) We propose TURL, a novel framework for learning deep contex- 퐻 Table schema 퐻 = fℎ0, ..., ℎ8, ..., ℎ< g tualized representations on relational tables via pre-training in an ℎ8 A column header (a sequence of tokens) unsupervised manner and task-specific fine-tuning. 퐸 Columns in table that contains entities e m There are two main challenges in the development of TURL: 4C The topic entity of the table 4C = ¹4C , 4C º (1) Relational table encoding. Existing neural network encoders are 4 An entity cell 4 = ¹4e, 4mº designed for linearized sequence input and are a good fit with un- in addition to including results for existing datasets when pub- structured texts. However, data in relational tables is organized licly available. Experimental results show that TURL substantially in a semi-structured format. Moreover, a relational table contains outperforms existing task-specific and shallow Word2Vec based multiple components including the table caption, headers and cell methods. values. The challenge is to develop a means of modeling the row- Our contributions are summarized as follows: and-column structure as well as integrating the heterogeneous information from different components of the table. (2) Factual • To the best of our knowledge, TURL is the first framework knowledge modeling. Pre-trained language models like BERT [15] that introduces the pre-training/fine-tuning paradigm to and ELMo [31] focus on modeling the syntactic and semantic char- relational Web tables. The pre-trained representations along acteristics of word use in natural sentences. However, relational with the universal model design save tremendous effort on tables contain a vast amount of factual knowledge about entities, engineering task-specific features and architectures. which cannot be captured by existing language models directly. Ef- • We propose a structure-aware Transformer encoder to model fectively modelling such knowledge in TURL is a second challenge. the structure information in relational tables. We also present To address the first challenge, we encode information from dif- a novel Masked Entity Recovery (MER) pre-training objec- ferent table components into separate input embeddings and fuse tive to learn the semantics as well as the factual knowledge them together. We then employ a structure-aware Transformer [38] about entities in relational tables. encoder with masked self-attention. The conventional Transformer • To facilitate research in this direction, we present a bench- model is a bi-directional encoder, so each element (i.e., token/entity) mark that consists of 6 different tasks for table interpretation can attend to all other elements in the sequence. We explicitly model and augmentation. We show that TURL generalizes well to the row-and-column structure by restraining each element to only various tasks and substantially outperforms existing models. aggregate information from other structurally related elements. To Our source code, benchmark, as well as pre-trained models achieve this, we build a visibility matrix based on the table structure will be available online. and use it as an additional mask for the self-attention layer. For the second challenge, we first learn embeddings for each 2 PRELIMINARY entity during pre-training. We then model the relation between We now present our data model and give a formal task definition: entities in the same row or column with the assistance of the visi- In this work, we focus on relational Web tables and are most bility matrix. Finally, we propose a Masked Entity Recovery (MER) interested in the factual knowledge about entities. Each table) 2 T pre-training objective. The technique randomly masks out enti- is associated with the following: (1) Table caption 퐶, which is a ties in a table with the objective of recovering the masked items short text description summarizing what the table is about. When based on other entities and the table context (e.g., caption/header). the page title or section title of a table is available, we concatenate This encourages the model to learn factual knowledge from the ta- these with the table caption. (2) Table headers 퐻, which define the bles and encode it into entity embeddings. In addition, we utilize table schema; (3) Topic entity 4C , which describes what the table is the entity mention by keeping it as additional information for a about and is usually extracted from the table caption or page title; certain percentage of masked entities. This helps our model build (4) Table cells 퐸 containing entities. Each entity cell 4 2 퐸 contains connections between words and entities . We also adopt the Masked a specific object with a unique identifier. For each cell, wedefine Language Model (MLM) objective from BERT, which aims to model the entity as 4 = ¹4e, 4mº, where 4e is the specific entity linked to the complex characteristics of word use in table metadata. the cell and 4m is the entity mention (i.e., the text string). We pre-train our model on 570K relational tables from Wikipedia ¹퐶, 퐻, 4C º is also known as table metadata and 퐸 is the actual to generate contextualized representations for tokens and entities table content.