Semantic Annotation for Tabular Data Udayan Khurana Sainyam Galhotra [email protected] [email protected] IBM Research AI University of Massachusetts, Amherst

Semantic Annotation for Tabular Data Udayan Khurana Sainyam Galhotra [email protected] [email protected] IBM Research AI University of Massachusetts, Amherst ABSTRACT identify semantic types for columns that are partially present in Detecting semantic concept of columns in tabular data is of partic- knowledge graphs and do not generalize well to generic types like ular interest to many applications ranging from data integration, Person name. (b) Data lake: Sherlock [17] and Sato [33] are the two cleaning, search to feature engineering and model building in ma- recent techniques that consider concept labelling task as multi-class chine learning. Recently, several works have proposed supervised classification problem. These techniques leverage open data to train learning-based or heuristic pattern-based approaches to semantic classifiers, however their systems are limited to the concepts which type annotation. Both have shortcomings that prevent them from are also an exact match in DBPedia concept list. generalizing over a large number of concepts or examples. Many Even though these techniques have shown a leap of improvement neural network based methods also present scalability issues. Ad- over prior rule based strategies, they suffer from the following ditionally, none of the known methods works well for numerical challenges. (a) Data requirement: Training deep learning based data. We propose 퐶2, a column to concept mapper that is based classifiers require large amounts of gold standard training data, on a maximum likelihood estimation approach through ensembles. which is often scarce for less popular types. There is a long tail It is able to effectively utilize vast amounts of, albeit somewhat of semantic types over the web; (b) Scale: Training deep learning noisy, openly available table corpora in addition to two popular networks for multi-class classification with thousands of classes knowledge graphs to perform effective and efficient concept pre- is highly inefficient. Techniques like Colnet [7] train a separate diction for structured data. We demonstrate the effectiveness of 퐶2 classifier for each type and do not scale beyond 100 semantic types. over available techniques on 9 datasets, the most comprehensive (c) Numerical data: These techniques are primarily designed for comparison on this topic so far. textual categorical columns, i.e., ones with named entities, and do not work for numerical types like population, area, etc; (d) Open 1 INTRODUCTION data: Colnet and HNN rely on the presence of cell values in curated knowledge graphs like DBPedia and do not leverage millions of web Semantic annotation of structured data is crucial for applications tables present over the web. On the other hand, Sato and Sherlock ranging from information retreival and data preparation to train- are able to utilize a small subset of open data lakes to train but their ing classifiers. Specifically, schema matching component ofdata approach is forced to discard most of it, and they do not benefit from integration requires accurate identification of column types for the well curated sources like DBPedia, Wikidata, etc. either. (e) Table input tables [29], while automated techniques for data cleaning and context: Sato and HNN consider the table context, i.e., neighboring transformation use semantic types to construct validation rules [21]. columns in predicting is considered through an extension of the Dataset discovery [34] and feature procurement in machine learn- above method using convolutional neural network style networks, ing [15] rely on semantic similarity of the different entities across where the problem of data sparsity intensifies; (f) Quality with scale: a variety of tables. Many commercial systems like Google Data Besides the limited current ability, it is not clear how the approach Studio [1], Microsoft Power BI [2], Tableau [3] use such annota- of multiple binary classifiers or multi-class classifiers can at least tions to understand input data, identify discrepancies and generate maintain quality with the number of classes in 1000s. It is also not vizualizations. clear, how they would be able to utilize the nature of overlapping Semantic annotation of a column in a table refers to the task of concepts - such as Person and Sports Person. identifying the real-world concepts that capture the semantics of the data. For example, a column containing ‘USA’, ‘UK’, ‘China’ is of the Example 1.1. Consider an input table containing information type Country. Even though semantic annotation of columns (and about airports from around the world in Figure 1. For the first arXiv:2012.08594v1 [cs.AI] 15 Dec 2020 IATA code tables in general) is crucial for various data science applications, column, , (the international airport code) is the most majority of the commercial systems use regular expression or rule appropriate semantic concept. When searching structured data based techniques to identify the column type. Such techniques sources, it might not be the top choice for any entity (JFK, SIN, ... ), require pre-defined patterns to identify column types, are not robust yet the challenge is to find it as the top answer. One may observe to noisy datasets and do not generalize beyond input patterns. that while not the top choice in individual results, it is the most There has been recent interest in leveraging deep learning based pervasive option. techniques to detect semantic types. These techniques demonstrate To address these limitations, we make the following observations. robustness to noise in data and superiority over rule based systems. (a) There are a plethora of openly available structured data from a Prior techniques can be categorized into two types based on the diverse set of sources such as data.gov, Wikipedia tables, collection type of data used for training and type of concepts identified. (a) of crawled Webtables, and others (often referred to as data lakes), Knowledge graph: Colnet [7] and HNN [8] are the most recent and knowledge graphs like DBPedia and Wikidata; (b) It is true that techniques that leverage curated semantic types from knowledge all sources are not well curated, and one should consider a degree graphs like DBPedia to generate candidate types and train classifiers of noise within each source. However, a robust ensemble over the to calculate the likelihood of each candidate. These techniques multiple input sources can help with noise elimination. While a Udayan Khurana and Sainyam Galhotra strict classification modeling method may require gold standard the recent applications include feature enrichment [14], schema data, a carefully designed likelihood estimation method may be mapping, data cleaning and transformation [19, 21], structured data more noise tolerant, and scale to larger contents of data; (c) Numeri- search [13] and other information retreival tasks. cal data needs special handling compared to categorical entity data. Many prior techniques have developed heuristics to identify While a numerical value is less unique than a named-entity (e.g., such patterns. In order to reduce the manual effort of enumerating 70 can mean several things), a group of numbers representing a patterns and rules, these techniques [9, 15, 18, 25, 27, 27, 28, 30] certain concept follow particular patterns. Using meta-features like perform fuzzy lookup of each cell value over knowledge graphs to range and distribution help with quick identification of numerical identify concepts. They assume that the cell values are present in concepts and are robust to small amounts of noise; (d) Instead of knowledge graphs and are not robust to noise in values. considering individual columns in isolation, global context of the Deng et al. [10] presented a method based on fuzzy matching dataset can help to jointly estimate the likelihood of each concept. between entities for a concept and the cell values of a column, and ranking using certain simiilarity scores. Their technique is highly Input Table Candidate Lists of Concept (Results) scalable but lacks robustness, as many movies have the same name ? ? ? ? (IATA Code, City, Country, Year Established) as novel and a majority counting based approach would confuse JFK New York USA 1948 (FAA Code, City, Country, Year Established) a column of movies with that of novels. Other techniques that (IATA Code, Film, Country, Year) LGA New York USA 1939 consider such heuristics include [9, 13–15, 18, 25, 27, 27, 28, 32]. (Airport, Restaurant, Olympic Team, Year) DEL New Delhi India 1962 Neumaier et al.[26] is specifically designed for numerical columns. ... SIN Singapore Singapore 1981 Their approach clusters the values in a column and use nearest neighbor search to identify the most likely concept. It does not leverage column meta-data and context of co-occuring columns. Graphical Models: Some of the advanced concept identification Entity Concepts found in Reference Data (in decreasing order) techniques generate features for each input column and use proba- JFK Cafe Movie Airport FAA Code IATA Code ... bilistic graphical models to predict the label. DEL Author Pub Station Company IATA Code ... Limaye et al. [23] use a graphical model to collectively determine SIN Molecule Nationality IATA Code Player Region ... cell annotations, column annotations and the binary column rela- tionships. These techniques have been observed to be sensitive to ... noisy values and do not capture semantically similar values (which New York State County City Song Magazine ... have been successfully captured by recent

Semantic Annotation for Tabular Data Udayan Khurana Sainyam Galhotra [email protected] [email protected] IBM Research AI University of Massachusetts, Amherst

Parichya Patra.P65

HAR DIN EASY, HAR DIN BACHAT 1676525 16/04/2008 FUTURE RETAIL LIMITED (ERSTWHILE KNOWN AS BHARTI RETAIL LIMITED) Knowledge House, Shyam Nagar, Off

Crime Thriller Cluster 1

Orthodox Suriyani Kristhyanikalude: Passion Week Songs

College Calendar for Website 2020-21