Relational Pre-Trained Transformer Is Almost All You Need Towards Democratizing Data Preparation

RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation Nan Tang Ju Fan Fangyi Li Jianhong Tu QCRI, HBKU, Qatar Renmin University, China Renmin University, China Renmin University, China [email protected] [email protected] [email protected] [email protected] Xiaoyong Du Guoliang Li Sam Madden Mourad Ouzzani Renmin University, China Tsinghua University, China CSAIL, MIT, USA QCRI, HBKU, Qatar [email protected] [email protected] [email protected] [email protected] ABSTRACT Q1: r1[name, expertise, city] = (Michael Jordan, Machine Learning, [M]) Berkeley Can AI help automate human-easy but computer-hard data prepara- A1: tion tasks that burden data scientists, practitioners, and crowd work- Q2: r3[name, afﬁliation] = (Michael [M], CSAIL MIT) ers? We answer this question by presenting RPT, a denoising autoen- A2: Cafarella coder for tuple-to-X models (“X” could be tuple, token, label, JSON, Q3: r2[name, expertise, [M]] = (Michael Jordan, Basketball, New York City) and so on). RPT is pre-trained for a tuple-to-tuple model by corrupt- A3: city ing the input tuple and then learning a model to reconstruct the (a) Sample Tasks for Value Filling ([M]: value to ﬁll) original tuple. It adopts a Transformer-based neural translation ar- <latexit sha1_base64="PELJVxOArAWGr52Nf6vBT8UXjxI=">AAADJ3icbZLNb9MwFMCd8DXKxzo4crGoQFyo4mrtygFpGwdA4lAkulVqqspxXxprjh3ZDlIV8t9w4V/hggQIwZH/BCft2rH1RdH7/L2XPDvKBDc2CP54/rXrN27e2rnduHP33v3d5t6DE6NyzWDIlFB6FFEDgksYWm4FjDINNI0EnEZnr6r86UfQhiv5wS4ymKR0LnnMGbUuNN3zXoYRzLksLI1yQXVZfBLrp2xgHCZVa2cUYRTjTKtZzmyJny59ptKMysXaX4Brce6kkCq9yRmmAWQZhhe78kGiJGASuKqjLBPgdCcgB0719l8fO9Vt9zGXLDF4Kzlag28l28Cdbu+cfl7R22FCKvpo8G4JvnCKdPo12GuTeixUczHewCHI2Xpb02YraAe14KsGWRkttJLBtPk9nCmWpyAtE9SYMQkyOymotpwJKBthbiCj7IzOYexMSVMwk6I+5xI/cZEZjpV2r7S4jl4kCpoas0gjV5lSm5jLuSq4LTfObdyfFFxmuQXJloPiXGCrcHVp8IxrYFYsnEGZ5u5bMUuopsy6q9VwSyCXf/mqcdJpk247eL/fOjxerWMHPUKP0TNE0AE6RG/QAA0R8z57X70f3k//i//N/+X/Xpb63op5iP4T/+8/J17uTw==</latexit> product company year memory screen chitecture that consists of a bidirectional encoder (similar to BERT) e1 iPhone 10 Apple 2017 64GB 5.8 inchs and a left-to-right autoregressive decoder (similar to GPT), leading e2 iPhone X Apple Inc 2017 256GB 5.8-inch to a generalization of both BERT and GPT. The pre-trained RPT can e3 iPhone 11 AAPL 2019 128GB 6.1 inches already support several common data preparation tasks such as data (b) A Sample Entity Resolution Task cleaning, auto-completion and schema matching. Better still, RPT <latexit sha1_base64="VbzSbnmdJaP568yLMzT896PumY0=">AAADRHicbVJLb9NAELbNq4RXC0cuIyoQhzayQ2nDLU0P7QWplL5EHVXrzSRedb1r7a6hwfWP48IP4MYv4MIBhLgixklUmpaRVvp2Zr5vZmcnyaWwLgy/+sG16zdu3pq73bhz9979B/MLD/etLgzHPa6lNocJsyiFwj0nnMTD3CDLEokHyclGHT94j8YKrXbdKMdexoZKDARnjlzHC/67OMGhUKVjSSGZqcozeZaXa2GWVYSqBkCc1uIEoIyTAdQqFTybXPpouRF5rXXukyxBWUEcz3CVdphofUJZreaLza2P0F7e0AaXINrtArx12rAh3dqbXcgw02ZEkdVloXgKO+iEYtAXNpdsRAp10qz+v0J5qhVSzmozCid0pwueUp+IagkYGLRaFnXLoAfQbrVPo7VXLcjFKUq7BOvRCnQpKDjkRnO0VhuiqT5wnSF8EC6FFapP3J3111ToYiMxqv75KI/nF8NmODa4CqIpWPSmtn08/yXua15kqByXzNqjKMxdr2TGCS6xasSFxZzxExrUEUHFMrS9crwEFTwlTx8G2tBRDsbei4ySZdaOsoQyM+ZSezlWO/8XOyrcoN0rhcoLh4pPCg0KSWMd7wJ9i0Hu5IgAo2WgXoGnzDDuaO8aNITo8pOvgv1WM3rZDN+sLHY603HMeY+9J95zL/LWvI635W17ex73P/nf/B/+z+Bz8D34FfyepAb+lPPIm7Hgz1+SlACo</latexit> can be fine-tuned on a wide range of data preparation tasks, such type description label s1 notebook 2.3GHz 8-Core, 1TB Storage, 8GB memory, 8GB as value normalization, data transformation, data annotation, etc. 16-inch Retina display To complement RPT, we also discuss several appealing techniques phone 6.10-inch touchscreen, a resolution of such as collaborative training and few-shot learning for entity res- t1 828x1792 pixels, A14 Bionic processor, and 4GB olution, and few-shot learning and NLP question-answering for come with 4GB of RAM information extraction. In addition, we identify a series of research (c) A Sample Information Extraction Task (s1: example, t1: task) opportunities to advance the field of data preparation. Figure 1: Motivating Scenarios. PVLDB Reference Format: Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam city Madden, and Mourad Ouzzani. RPT: Relational Pre-trained Transformer Is the for the “Michael Jordan” whose expertise is “Machine Almost All You Need towards Democratizing Data Preparation. PVLDB, Learning”. (ii) Value Filling: Q2 asks for the last name of someone 14(8): 1254 - 1261, 2021. who works at “CSAIL MIT” with the first name “Michael”. doi:10.14778/3457390.3457391 Answering Q1 can help solve a series of problems such as er- ror detection, data repairing, and missing value imputation; and 1 INTRODUCTION answering Q2 can help auto-completion (e.g., give the answer A2 e.g., Data preparation — including data cleaning [1], data transforma- “Cafarella”) and auto-suggestion ( provide a list of candidate tion [31], entity resolution [24], information extraction [10], and names such as {Cafarella, Stonebraker}). so forth — is the most time-consuming and least enjoyable work Scenario 2: Attribute Filling for Schema Matching. Figure 1(a) Q3 for data scientists [18]. Next, we present several scenarios to better asks for the attribute name for the value “New York City”, w.r.t. understand these problems. name “Michael Jordan” and expertise “Basketball”. Answering Scenario 1: Data Cleaning. Figure 1(a) Q1 and Q2 show two typ- this question can help schema matching, a core data integration ical data cleaning problems. (i) Cell Filling: Question Q1 asks for problem [21], by better aligning attributes from different tables. Scenario 3: Entity Resolution (ER) Figure 1(b) shows a typical ER ∗ Ju Fan is the corresponding author. task that asks whether e1, e2 and e3 are the “same”. This work is licensed under the Creative Commons BY-NC-ND 4.0 International A human with enough knowledge can tell that “iPhone 10” = License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of “iPhone X” , “iPhone 11”, “Apple” = “Apple Inc” = “AAPL”, and this license. For any use beyond those covered by this license, obtain permission by “inches” = “-inch”. Hence, one can decide that e and e do not emailing [email protected]. Copyright is held by the owner/author(s). Publication rights 1 2 licensed to the VLDB Endowment. match e3, and e1 matches e2 (if the memory does not matter). Proceedings of the VLDB Endowment, Vol. 14, No. 8 ISSN 2150-8097. doi:10.14778/3457390.3457391 Scenario 4: Information Extraction (IE) Figure 1(c) shows an IE task, which is typically done via crowdsourcing [39]. A requester 1254 provides several samples (e.g., s1) that show what “label” should be objectives (where the model is trained to recover missing pieces in extracted, and asks workers to perform similar tasks (e.g., t1). the input) work best; examples include BERT [20], BART [44], and A crowd worker needs first to interpret the task by analyzing s1 T5 [55] for NLP, and TURL [19] for relational tables. (and maybe a few more examples) and concretizes it as “what is the Contributions. We make the following notable contributions. memory size”. Afterwards, he can perform t1 by extracting the label 4GB from t1»description] by knowing “RAM” is for memory. • RPT: We describe a standard Transformer-based denoising Challenges. Scenarios (1–4) are simple for humans, but are hard autoencoder architecture to pre-train sequence-to-sequence for computers. To solve them, computers face the following chal- models for tuple-to-tuple training, with new tuple-aware lenges. (1) Knowledge: computers need to have the background masking mechanisms. (Section 2) • knowledge through understanding an enormous corpora of tables. Fine-tuning RPT: We discuss a wide range of data preparation (2) Experience: computers should be able to learn from prior and tasks that can be supported by fine-tuning RPT. (Section 3) • various tasks. (3) Adaptation: computers should (quickly) adjust to Beyond RPT: We discuss several appealing techniques that new inputs and new tasks. can complement RPT in specific data preparation tasks, e.g., collaborative training and few-shot learning for ER, and few- Vision. Indeed, these problems have been seen as ‘holy grail’ prob- shot learning and NLP question-answering for IE. (Section 4) lems for the database community for decades [1, 21, 25, 30, 36, 46], but despite thousands of papers on these topics, they still remain unsolved. Recent evidence from the NLP community, where DL- 2 RPT based models and representations have been shown to perform 2.1 Architecture nearly as well as humans on various language understanding and RPT uses a standard sequence-to-sequence (or encoder-decoder) question answering tasks, suggests that a learned approach may be Transformer [64] model, similar to BART [52], as shown in Figure 2. a viable option for these data preparation tasks as well. Encoder. RPT uses a bidirectional encoder (similar to BERT [20]) Desiderata. The desiderata for AI-powered tools to achieve near- because it has the advantage of learning to predict the corrupted human intelligence for data preparation is summarized as below, data bidirectionally, from both the context on the left and the con- in response to Challenges (1–3), respectively. (1) Deep learning ar- text on the right of the corrupted data. Moreover, it is Transformer- chitecture and self-supervised pre-training. We need a deep learning based, which can use self-attention to generate a richer representa- architecture that can learn from many tables, similar to language tion of

Load more