Automatic Discovery of the Statistical Types of Variables in a Dataset
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Discovery of the Statistical Types of Variables in a Dataset Isabel Valera 1 Zoubin Ghahramani 1 2 Abstract plore, find patterns or make predictions on the data. As an A common practice in statistics and machine example, a prediction task is solved differently depending learning is to assume that the statistical data types on the kind of data to be predicted—e.g., while prediction (e.g., ordinal, categorical or real-valued) of vari- on categorical variables is usually formulated as a classifi- ables, and usually also the likelihood model, is cation task, in the case of ordinal variables it is formulated known. However, as the availability of real- as an ordinal regression problem (Agresti, 2010). More- world data increases, this assumption becomes over, different data types should be pre-processed and input too restrictive. Data are often heterogeneous, differently in the predictive tool—e.g., categorical inputs complex, and improperly or incompletely doc- are often transformed into as many binary inputs (which umented. Surprisingly, despite their practical state whether the object belongs to a category or not) as importance, there is still a lack of tools to au- number of categories; positive real inputs might be log- tomatically discover the statistical types of, as transformed, etc. well as appropriate likelihood (noise) models for, Information on the statistical data types in a dataset be- the variables in a dataset. In this paper, we fill comes particularly important in the context of statistical this gap by proposing a Bayesian method, which machine learning (Breiman, 2001), where the choice of a accurately discovers the statistical data types in likelihood model appears as a main assumption. Although both synthetic and real data. extensive work has focused on model selection (Ando, 2010; Burnham & Anderson, 2003), the likelihood model 1. Introduction is usually assumed to be known and fixed. As an example, Data analysis problems often involve pre-processing raw a common approach is to model continuous data as Gaus- data, which is a tedious and time-demanding task due to sian variables, and discrete data as categorical variables. several reasons: i) raw data is often unstructured and large- However, while extensive work has shown the advantages scale; ii) it contains errors and missing values; and iii) of capturing the statistical properties of the observed data in documentation may be incomplete or not available. As a the likelihood model (Chu & Ghahramani, 2005a; Schmidt consequence, as the availability of data increases, so does et al., 2009; Hilbe, 2011; Valera & Ghahramani, 2014), the interest of the data science community to automate this there still exists a lack of tools to automatically perform process. In particular, there are a growing body of work likelihood model selection, or equivalently to discover the which focuses on automating the different stages of data most plausible statistical type of the variables in the data, pre-processing, including data cleaning (Hellerstein, 2008), directly from the data. data wrangling (Kandel et al., 2011) and data integration and fusion (Dong & Srivastava, 2013). In this work, we aim to fill this gap by proposing a gen- eral and scalable Bayesian method to solve this task. The The outcome of data pre-processing is commonly a struc- proposed method exploits the latent structure in the data to tured dataset, in which the objects are described by a set of automatically distinguish among real-valued, positive real- attributes. However, before being able to proceed with the valued and interval data as types of continuous variables, predictive analytics step of the data analysis process, the and among categorical, ordinal and count data as types of data scientist often needs to identify which kind of vari- discrete variables. The proposed method is based on prob- ables (i.e., real-values, categorical, ordinal, etc.) these at- abilistic modeling and exploits the following key ideas: tributes represent. This labeling of the data is necessary i) There exists a latent structure in the data that capture to select the appropriate machine learning approach to ex- the statistical dependencies among the different ob- 1University of Cambridge, Cambridge, United Kingdom; jects and attributes in the dataset. Here, as in standard 2Uber AI Labs, San Francisco, California, USA. Correspondence latent feature modeling, we assume that we can cap- to: Isabel Valera <[email protected]>. ture this structure by a low-rank representation, such that conditioning on it, the likelihood model factorizes Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 for both number of objects and attributes. by the author(s). ii) The observation model for each attribute can be ex- Automatic Discovery of the Statistical Types of Variables in a Dataset pressed as a mixture of likelihood models, one per need access to exact information on whether its consecu- each considered data type, where the inferred weight tive values are equidistant or not, however, this information associated to a likelihood model captures the proba- depends on how the data have been gathered. For exam- bility of the attribute belonging to the corresponding ple, an attribute that collects information on “frequency of data type. an action” will correspond to an ordinal variable if its cat- egories belong to, e.g., “never”, “sometimes”, “usually”, We derive an efficient MCMC inference algorithm to f jointly infer both the low-rank representation and the “often” , and to a count variable if it takes values in `‘0 times perg week”, “1 time per week”, . f weight of each likelihood model for each attribute in the g observed data. Our experimental results show that the pro- Previous work (Hernandez-Lobato et al., 2014) proposed to posed method accurately discovers the true data type of the distinguish between categorical and ordinal data by com- variables in a dataset, and by doing so, it fits the data sub- paring the model evidence and the predictive test log- stantially better than modeling continuous data as Gaussian likelihood of ordinal and categorical models. However, this variables and discrete data as categorical variables. approach can be only used to distinguish between ordinal and categorical data, and it does so by assuming that it has 2. Problem Statement access to a real-valued variable that contains information As stated above, the outcome of the pre-processing step about the presence of an ordering in the observed discrete of data analysis is a structured dataset, in which a set of (ordinal or categorical) variable. As a consequence, it can- objects are defined by a set of attributes, and our objec- not be easily generalizable to label the data type of all the tive is to automatically discover which type of variables variables (or attributes) in a dataset. In contrast, in this pa- these attributes correspond to. In order to distinguish be- per we proposed a general method that allows us to distin- tween discrete and continuous variables, we can apply sim- guish among real-valued, positive real-valued and interval ple logic rules, e.g.. count the number of unique values that data as types of continuous variables, and among categor- the attribute takes and how many times we observe these at- ical, ordinal and count data as types of discrete variables. tributes. Moreover, binary variables are invariant to the la- Moreover, the general framework we present can be readily beling of the categories, and therefore, both categorical and extended to other data types as needed. ordinal models are equivalent in this case. However, distin- guishing among different types of discrete and continuous 3. Methodology variables cannot be easily solved using simple heuristics. In this section, we introduce a Bayesian method to deter- In the context of continuous variables, given the finite size mine the statistical type of variable that corresponds to each of observed datasets, it is complicated to identify whether of the attributes describing the objects in an observation a variable may take values in the entire real line, or only on matrix X. In particular, we propose a probabilistic model, in which we assume that there exists a low-rank representa- an interval of it, e.g., (0; ) or (θL; θH ). In other words, due to the finite observation1 sample, we cannot distinguish tion of the data that captures its latent structure, and there- whether the data distribution has an infinite tail that we fore, the statistical dependencies among its objects and at- d have not observed, or its support is limited to an interval. tributes. In detail, we consider that each observation xn As an illustrative example, Figures2(d)&(f) in Section4 can be explained by a K-length vector of latent variables show two data distributions that, although at a first sight zn = [zn1; : : : ; znK ] associated to the n-th object and a d d d look similar, correspond respectively to a Beta variable, weighting vector b = [b1; : : : ; bK ] (with K being the d which therefore takes values in the interval (0; 1), and a number of latent variables), whose elements bk weight the gamma variable, which takes values in (0; ). contribution of k-th the latent feature to the d-th attribute 1 in X. Then, given the latent low-rank representation of the In the context of discrete data, it is impossible to tell the dif- data, the attributes describing the objects in a dataset are ference between categorical and ordinal variables in isola- assumed to be independent, i.e., tion. The presence of an order in the data only makes sense given a context. As an example, while colors in M&Ms D d D Y d d usually do not present an order, colors in a traffic light p(X Z; b =1) = p(x Z; b ); j f gd j clearly do.