Automatic Discovery of the Statistical Types of Variables in a Dataset

Isabel Valera 1 Zoubin Ghahramani 1 2

Abstract plore, find patterns or make predictions on the . As an A common practice in and machine example, a prediction task is solved differently depending learning is to assume that the statistical data types on the kind of data to be predicted—e.g., while prediction (e.g., ordinal, categorical or real-valued) of vari- on categorical variables is usually formulated as a classifi- ables, and usually also the likelihood model, is cation task, in the case of ordinal variables it is formulated known. However, as the availability of real- as an problem (Agresti, 2010). More- world data increases, this assumption becomes over, different data types should be pre-processed and input too restrictive. Data are often heterogeneous, differently in the predictive tool—e.g., categorical inputs complex, and improperly or incompletely doc- are often transformed into as many binary inputs (which umented. Surprisingly, despite their practical state whether the object belongs to a category or not) as importance, there is still a lack of tools to au- number of categories; positive real inputs might be log- tomatically discover the statistical types of, as transformed, etc. well as appropriate likelihood (noise) models for, Information on the statistical data types in a dataset be- the variables in a dataset. In this paper, we fill comes particularly important in the context of statistical this gap by proposing a Bayesian method, which machine learning (Breiman, 2001), where the choice of a accurately discovers the statistical data types in likelihood model appears as a main assumption. Although both synthetic and real data. extensive work has focused on model selection (Ando, 2010; Burnham & Anderson, 2003), the likelihood model 1. Introduction is usually assumed to be known and fixed. As an example, Data analysis problems often involve pre-processing raw a common approach is to model continuous data as Gaus- data, which is a tedious and time-demanding task due to sian variables, and discrete data as categorical variables. several reasons: i) raw data is often unstructured and large- However, while extensive work has shown the advantages scale; ii) it contains errors and missing values; and iii) of capturing the statistical properties of the observed data in documentation may be incomplete or not available. As a the likelihood model (Chu & Ghahramani, 2005a; Schmidt consequence, as the availability of data increases, so does et al., 2009; Hilbe, 2011; Valera & Ghahramani, 2014), the interest of the data science community to automate this there still exists a lack of tools to automatically perform process. In particular, there are a growing body of work likelihood model selection, or equivalently to discover the which focuses on automating the different stages of data most plausible statistical type of the variables in the data, pre-processing, including data cleaning (Hellerstein, 2008), directly from the data. data wrangling (Kandel et al., 2011) and data integration and fusion (Dong & Srivastava, 2013). In this work, we aim to fill this gap by proposing a gen- eral and scalable Bayesian method to solve this task. The The outcome of data pre-processing is commonly a struc- proposed method exploits the latent structure in the data to tured dataset, in which the objects are described by a set of automatically distinguish among real-valued, positive real- attributes. However, before being able to proceed with the valued and interval data as types of continuous variables, predictive analytics step of the data analysis process, the and among categorical, ordinal and as types of data scientist often needs to identify which kind of vari- discrete variables. The proposed method is based on prob- ables (i.e., real-values, categorical, ordinal, etc.) these at- abilistic modeling and exploits the following key ideas: tributes represent. This labeling of the data is necessary i) There exists a latent structure in the data that capture to select the appropriate machine learning approach to ex- the statistical dependencies among the different ob- 1University of Cambridge, Cambridge, United Kingdom; jects and attributes in the dataset. Here, as in standard 2Uber AI Labs, San Francisco, California, USA. Correspondence latent feature modeling, we assume that we can cap- to: Isabel Valera . ture this structure by a low-rank representation, such that conditioning on it, the likelihood model factorizes Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, 2017. JMLR: W&CP. Copyright for both number of objects and attributes. 2017 by the author(s). ii) The observation model for each attribute can be ex- Automatic Discovery of the Statistical Types of Variables in a Dataset

pressed as a mixture of likelihood models, one per need access to exact information on whether its consecu- each considered , where the inferred weight tive values are equidistant or not, however, this information associated to a likelihood model captures the proba- depends on how the data have been gathered. For exam- bility of the attribute belonging to the corresponding ple, an attribute that collects information on “frequency of data type. an action” will correspond to an ordinal if its cat- egories belong to, e.g., “never”, “sometimes”, “usually”, We derive an efficient MCMC inference algorithm to { jointly infer both the low-rank representation and the “often” , and to a count variable if it takes values in ‘‘0 times per} week”, “1 time per week”, . . . . { weight of each likelihood model for each attribute in the } observed data. Our experimental results show that the pro- Previous work (Hernandez-Lobato et al., 2014) proposed to posed method accurately discovers the true data type of the distinguish between categorical and by com- variables in a dataset, and by doing so, it fits the data sub- paring the model evidence and the predictive test log- stantially better than modeling continuous data as Gaussian likelihood of ordinal and categorical models. However, this variables and discrete data as categorical variables. approach can be only used to distinguish between ordinal and categorical data, and it does so by assuming that it has 2. Problem Statement access to a real-valued variable that contains information As stated above, the outcome of the pre-processing step about the presence of an ordering in the observed discrete of data analysis is a structured dataset, in which a set of (ordinal or categorical) variable. As a consequence, it can- objects are defined by a set of attributes, and our objec- not be easily generalizable to label the data type of all the tive is to automatically discover which type of variables variables (or attributes) in a dataset. In contrast, in this pa- these attributes correspond to. In order to distinguish be- per we proposed a general method that allows us to distin- tween discrete and continuous variables, we can apply sim- guish among real-valued, positive real-valued and interval ple logic rules, e.g.. count the number of unique values that data as types of continuous variables, and among categor- the attribute takes and how many times we observe these at- ical, ordinal and count data as types of discrete variables. tributes. Moreover, binary variables are invariant to the la- Moreover, the general framework we present can be readily beling of the categories, and therefore, both categorical and extended to other data types as needed. ordinal models are equivalent in this case. However, distin- guishing among different types of discrete and continuous 3. Methodology variables cannot be easily solved using simple heuristics. In this section, we introduce a Bayesian method to deter- In the context of continuous variables, given the finite size mine the statistical type of variable that corresponds to each of observed datasets, it is complicated to identify whether of the attributes describing the objects in an observation a variable may take values in the entire real line, or only on matrix X. In particular, we propose a probabilistic model, in which we assume that there exists a low-rank representa- an interval of it, e.g., (0, ) or (θL, θH ). In other words, due to the finite observation∞ sample, we cannot distinguish tion of the data that captures its latent structure, and there- whether the data distribution has an infinite tail that we fore, the statistical dependencies among its objects and at- d have not observed, or its support is limited to an interval. tributes. In detail, we consider that each observation xn As an illustrative example, Figures2(d)&(f) in Section4 can be explained by a K-length vector of latent variables show two data distributions that, although at a first sight zn = [zn1, . . . , znK ] associated to the n-th object and a d d d look similar, correspond respectively to a Beta variable, weighting vector b = [b1, . . . , bK ] (with K being the d which therefore takes values in the interval (0, 1), and a number of latent variables), whose elements bk weight the gamma variable, which takes values in (0, ). contribution of k-th the latent feature to the d-th attribute ∞ in X. Then, given the latent low-rank representation of the In the context of discrete data, it is impossible to tell the dif- data, the attributes describing the objects in a dataset are ference between categorical and ordinal variables in isola- assumed to be independent, i.e., tion. The presence of an order in the data only makes sense given a context. As an example, while colors in M&Ms D d D Y d d usually do not present an order, colors in a traffic light p(X Z, b =1) = p(x Z, b ), | { }d | clearly do. Similarly, we cannot easily distinguish between d=1 ordinal data (which take values in a finite ordered set) and where we gather the latent feature vectors z in a N K n × count data (which take values in an infinite ordered set with matrix Z. For convenience, here zn is a K-length row vec- equidistant values) due to two main reasons. First, similarly tor, while bd is a K-length column vector. The above model to continuous variables, since datasets contain a finite num- resembles standard latent feature models (Salakhutdinov & ber of examples, it is difficult to tell whether we have ob- Mnih, 2007; Griffiths & Ghahramani, 2011), which assume served the finite set of possible values of a variable, or sim- known and fixed likelihood models p(xd Z, bd). In con- ply a finite subsample of an infinite set. Second, we would trast, in this paper we aim to infer the statistical| data type Automatic Discovery of the Statistical Types of Variables in a Dataset

(or equivalently, the likelihood model) that better captures wd ` d the distribution of each attribute in X. To this end, here we 2 L assume that the likelihood model of the d-th attribute in X d d b` sn is a mixture of likelihood functions such that d Z x wd X d d d d d d d zn yn` xn p(x Z, b` `∈Ld ) = w` p`(x Z, b` ), | { } | d `∈Ld B d =1,...,D d =1,...,D n =1,...,N where d is the set of possible types of variables (or equiv- alently,L likelihood models) to be considered for this at- (a) Proposed model (b) Alternative representation d Figure 1. Model illustration. tribute, and the weight w` captures the probability of the likelihood function ` in the d-th attribute of the observa- of the data (which includes the latent feature matrix Z and tion matrix X. Note that, the above expression is a valid d the corresponding weighting vectors b {`∈L |d=1,...,D}) P d { ` } d likelihood model as long as ∈Ld w = 1 and each d D ` ` and the likelihood weights w d=1. Second, we need to p (xd Z, bd, Ψd) is a normalized probability density func- { } ` | ` ` do so given a heterogeneous (and non-conjugate) observa- tion or probability mass function for, respectively, contin- tion model, which combines D different likelihood mod- uous and discrete variables. Hence, under the proposed els, corresponding each of them to a mixture of likelihood model, which is is illustrated in Figure 1a, the likelihood functions and coupled through the latent feature matrix Z. factorizes as Additionally, these likelihood functions do not only cor- D Y X respond to either a probability density function or a prob- p(X Z, bd ) = wd p (xd Z, bd). (1) | { ` } ` ` | ` ability mass function depending on whether we are deal- d=1 ∈Ld ` ing with a continuous or a discrete variable, but also each We place a Dirichlet prior distribution on the likelihood mixture combines likelihood functions with different sup- d d weights w = [w` ]`∈Ld , and similarly to (Salakhutdinov ports. For example, while real-valued data lead to a like- & Mnih, 2007), assume that both the latent feature vectors lihood function with the real line as support, interval data d zn and the weighting vectors bj are Gaussian distributed only accounts for a segment of the real line. Similarly, both 2 2 categorical and ordinal data assume a finite support, while with zero and covariance matrices σz I and σb I, re- spectively. Here, I denotes the identity matrix of size equal count data requires an infinite-support likelihood function. to the number of latent features K. In order to allow for efficient inference, we exploit the key Moreover, we consider the following types of data for, re- idea in (Valera & Ghahramani, 2014) to propose an alter- spectively, continuous and discrete variables: native and equivalent model representation (shown in Fig- Continuous variables: ure 1b), which efficiently deal with heterogeneous likeli- • 1. Real-valued data, which takes values in the real hood functions. In this alternative model representation, d we include for each observation d as many Gaussian vari- line, i.e., xn . xn ∈ < d d 2 2. Positive real-valued data, which takes values in ables (or pseudo-observations) yn` (znb` , σy) as the d + ∼d N the positive real line, i.e., xn . number of likelihood functions in , and assume that ∈ < L d 3. Interval data, which takes values in an interval of there exists a transformation function over the variables yn` the real line, i.e., xd (θ , θ ), where θ , θ which maps the real line into the support of the likelihood n ∈ L H L H ∈ < and θ θ . function `, Ω`, i.e., < L ≤ H Discrete variables: f : Ω • ` ` . (2) 1. Categorical data, which takes values in a finite

3.1. Likelihood functions 3.1.2. DISCRETE VARIABLES In this section, we provide the set of transformations to map 1. Categorical Data. Now we account for categorical ob- d d from the Gaussian pseudo-observations yn` into the types servations, i.e., each observation xn can take values in the of data defined above, specifying also the six likelihood unordered index set 1,...,Rd . Hence, assuming a multi- functions that our method will account for. nomial ,{ we can write}

x = fcat(y) = arg max y(r), 3.1.1. CONTINUOUS VARIABLES r∈{1,...,Rd} In the case of continuous variables, we assume that the where in this case there are as many pseudo-observations as mapping functions f` are continuous invertible and dif- number of categories and each pseudo-observation can be d d 2 d ferentiable functions, such that we can obtain correspond- sampled as yncat(r) (znbcat(r), σy) where bcat(r) ing likelihood function (after integrating out the pseudo- denotes the K-length∼ weighting N vector, which weights the d observation yn`) as influence the latent features for a categorical observation xd taking value r. Note that, under this likelihood model, 1 d n d bd d −1 d we need one pseudo-observation d and a weighting p`(xn zn, ` , sn = `) = q d f` (xn) yncat(r) | 2 2 dxn d 2π(σy + σu) vector bcat(r) for each possible value of the observation   r 1,...,Rd . 1 −1 d d 2 ∈ { } exp 2 2 (f` (xn) znb` ) , × −2(σy + σu) − Under the model, we can obtain the d probability of xn taking value r 1,...,Rd as (Giro- −1 lami & Rogers, 2005) ∈ { } where f` is the inverse function of the transformation −1 d d f ( ), i.e., f (f (v)) = v. Next, we provide examples of pcat(x = r z , b , s = cat) ` ` ` | n cat n mapping· functions that allow us to account for real-valued, " # Rd   positive real-valued, and interval data. Y d d 0 = Ep(u) Φ u + zn(bcat(r) bcat(r )) , − r0=1 1. Real-valued Data. In order to obtain real-valued ob- r06=r d servations, i.e., xn , we need a transformation over d that maps from the∈ < real numbers to the real numbers, where Φ( ) denotes the cumulative density function of the yn · i.e., f< : . The simplest case is to assume that standard and Ep(u)[ ] denotes expecta- < → < · 2 , and therefore, each observation tion with respect to the distribution p(u) = (0, σy). x = f<(y + u) = y + u N is distributed as xd (z bd , σ2 + σ2). Nevertheless, n ∼ N n < y u 2. Ordinal Data. Consider ordinal data, in which each ele- other mapping functions can be used, e.g., we will use in d ment xn takes values in the ordered index set 1,...,Rd . our experiments the transformation Then, assuming an model, we can{ write }  d d x = f<(y + u) = w(y + u) + µ, 1 if ynord θ1  d ≤d d  2 if θ1 < y θ2 d d nord ≤ where w and µ are parameters allowing attribute rescaling, xn = ford(ynord) = . and tuneable by the user.  .  d d Rd if θR −1 < ynord 2. Positive Real-valued Data. As an example of a func- d d tion that maps from the real numbers to the positive real where again ynord is Gaussian distributed with mean d 2 d numbers, i.e., , we consider znbord and variance σy, and θr for r 1,...,Rd 1 f<+ : + ∈ { − } < 7→ < are the thresholds that divide the real line into Rd re- d x = f<+ (y + u) = log(1 + exp(w(y + u))). gions. We assume the thresholds θr are sequentially d generated from the truncated Gaussian distribution θr where w allows attribute rescaling. (0, σ2, θd , ), where θd = and θd = + ∼. TN θ r−1 ∞ 0 −∞ Rd ∞ 3. Interval Data. As an example of a function the maps As opposed to the categorical case, now we have a unique weighting vector bd and a unique Gaussian variable yd from the real numbers into the interval (θL, θH ), i.e., f<+ : ord nord for each observation xd , and the value of xd is determined (θL, θH ), we consider the transformation n n < 7→ d by the region in which ynord falls. θH θL x = fInt(y + u) = − + θ , 1 + exp( w(y + u)) L Under the ordered probit model (Chu & Ghahramani, d − 2005b), the probability of each element xn taking value 1 r 1,...,R can be written as where w, θL and θH are user hyperparameters. ∈ { d} 1 d d d d In our experiments, we assume θL = arg minn(xn) −  and pord(xn = r zn, bord, sn = ord) d | θH = arg maxn(xn)+, where  → 0 is a user hyper-parameter. d d ! d d ! d θ z b θ znb We set the rescaling parameter w = 2/ max(x ) for the three = Φ r − n ord Φ r−1 − ord . continuous data types. σy − σy Automatic Discovery of the Statistical Types of Variables in a Dataset

d 3. Count Data. In count data each observation xn takes Algorithm 1 Inference Algorithm. non-negative integer values, i.e., xd 0,..., . Then, Input: X n ∈ { ∞} d d we assume Initialize: S, {b` } and {yn`} d d d 1: for each iteration do xn = fcount(yn) = g(yn) , b c 2: Update Z given {bd} and {yd }. where v returns the floor of v, that is the largest integer ` n` b c 3: for d = 1,...,D do that does not exceed v, and g : + is a monotonic 4: for ` ∈ Ld do differentiable function, in our experiments< → < g(y) = log(1 + 5: for n = 1,...,N do d d d d exp(wy)). We can thus write the likelihood function as 6: Sample {yn`} given xn, Z, {b` } and sn. 7: end for d d d d d pcount(x z , b , s = count) = 8: Sample {b } given Z and {y } . n| n ord n ` n` −1 d d ! −1 d d ! 9: for n = 1,...,N do g (x + 1) znb g (x ) znb d d d Φ n − count Φ n − count 10: Sample sn given xn, Z and {b` }. σy − σy 11: end for 12: end for where g−1 : + is the inverse function of the trans- 13: Sample wd given S. formation g( <). → < 14: end for · 15: end for d 3.2. Inference Algorithm Output: Likelihood weights w . Here, we exploit the model representation in Figure 1b to derive an efficient inference algorithm that allows us to in- 1. For categorical observations: fer all the latent variables in the model, providing as output d d d d p(y (r) x = T, zn, b , s = cat) d ncat | n cat n the likelihood weights w , which determine the probabil-  d 2 d (z b (r), σ , max 6= (y (j)), ), r = T ity of the d-th attribute in X belonging to each of the above = TN n cat y j r ncat ∞ (z bd (r), σ2, , yd (T )), r = T data types. Algorithm1 summarizes the inference. TN n cat y −∞ ncat 6 d d Sampling low-rank decomposition. In order to In words, if xn = T = r we sample ynr sample the latent feature matrix Z and the associ- from a truncated Normal distribution with mean d z bd (r), variance σ2 and truncated on the left by ated weighting vectors b` , we condition on the n cat y { } d pseudo-observations such that we can efficiently sam- maxj6=r(yncat(j)). Otherwise, we sample from a d  ple the feature vectors as zn µz, Σz , where truncated Gaussian (with same mean and variance) ∼ N −1 d d Pd P d d > −2  truncated on the right by yncat(r) with r = xn. Note Σ = d b (b ) + σ I and µ = z d=1 `∈L ` ` z z that sampling from the variables yd (r) corresponds   ncat PN P d d to solve a multinomial probit regression problem. Σz n `∈Ld b` yn` . Note that this step involves a matrix inversion of size K (the number of latent features) Hence, to achieve identifiability we assume, without loss of generality, that the regression function per iteration of the algorithm. Similarly, the weighting vec- fRd (zn) d d d  is identically zero, and thus, we fix b (Rd) = 0. tors can be sampled as b µ , Σb , where Σb = ` ` ∼ N ` 2. For ordinal observations: −2 > −2 −1 d PN > d  σy Z Z + σb I and µ` = Σb n zn yn` . Since d d d d p(ynord xn = r, zn, bord, sn = ord) d d | Σb is shared for all b with ` and d = 1 ...,D, d d 2 d d ` = (y znb , σ , θ , θ ). this step also involves{ one} matrix∈ inversions L of size K per TN nord| ord y r−1 r iteration of the algorithm. Note that in this case, we also need to sample the val- d Sampling pseudo-observations. Given the low-rank de- ues for the thresholds θ with r = 1,...,Rd 1 as r − composition and the likelihood assignments S, we can sam- d d d 2 d p(θr ynord) = (θr 0, σθ , θmin, θmax), ple each pseudo-observation yn` from its prior distribution | TN | d d d d d if sn = `, and from its posterior distribution if sn = `. where θmin = max(θr−1, maxn(ynord xn = r)) and 6 d d d | θmax = min(θr , minn(ynord xn = r + 1)). In words, In the case of continuous variables, the posterior distribu- d | d d each θr is constrained to be between θr−1 and θr+1, tion of the pseudo-observation can be obtained as d as well as to ensure that the pseudo-observations ynord   d d d d d d d 2 associated to the observations xn = r and xn = r + 1 p(yn` xn, zn, b` , sn = `) = yn µˆy, σˆy , d | N fall respectively at the left and at the right side of θr .

d −1 d Since in this ordinal regression problem the thresholds  (znb ) f (x )  ` ` n 2 2 Rd where µˆy = 2 + 2 σˆ , and σˆ = σy σu y y θr r=1 are unknown, we set θ1 to a fixed value in −1 { }  1 1  order to achieve identifiability. 2 + 2 . σy σu 3. For count observations: p(yd xd , z , bd , sd = count) In the case of discrete variables, the posterior distribution ncount| n n count n of the pseudo-observation can be computed as follows. = (yd z bd , σ2, g−1(xd ), g−1(xd + 1)), TN ncount| n count y n n Automatic Discovery of the Statistical Types of Variables in a Dataset

1 250 400 600 0.8 200 d 300

w 400 0.6 Real 150 Positive 200 0.4 100 Interval 200

Weights Weights 0.2 50 100 0 0 0 0 Beta(0.5,0.5) Beta(0.5,1) Beta(0.5,3) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (a) Interval Data (b) Beta(0.5, 0.5) (c) Beta(0.5, 1) (d) Beta(0.5, 3)

1 600 400 250 0.8 200 d 300

w 400 0.6 Real 150 Positive 200 0.4 100 Interval 200

Weights Weights 0.2 100 50 0 0 0 0 !(1,1) !(3,1) !(5,1) 0 2 4 6 8 0 5 10 15 0 5 10 15 (e) Positive Real-valued Data (f) Γ(1, 1) (g) Γ(3, 1) (h) Γ(5, 1)

1 Real 250 250 250 Positive 0.8 200 200 200

d Interval w 0.6 150 150 150 0.4 100 100 100

Weights Weights 0.2 50 50 50 0 0 0 0 N(0,10) N(10,10) N(0,100) -40 -20 0 20 40 -20 0 20 40 -400 -200 0 200 400 (i) Real-valued Data (j) N (0, 10) (k) N (10, 10) (l) N (10, 100) Figure 2. [Synthetic Continuous Data] The first column shows the distribution of the inferred likelihood weights wd when the ground truth data is (a) interval, (e) positive real-valued, and (i) real-valued. The remaining columns show example histograms of the datasets.

−1 where g : + is the inverse function of g, i.e., First, we focus on continuous variables by generating uni- −1 < → < d g (g(y)) = y. Therefore, yncount is sampled from a variate datasets with 1, 000 observations sampled from a −1 d Gaussian truncated on the left by g (xn) and on the known probability density function, which corresponds to −1 d right by g (xn + 1). i) a Gaussian distribution when considering real-valued data; ii) a for positive real-valued data; Sampling likelihood assignments. In order to improve and iii) a (scaled) Beta distribution for interval data lying the mixing properties of the sampler, when sampling sd n in the interval (0, θ ) where θ takes values 0.1, 1 or 100. we integrate out the pseudo-observations yd . Then,{ the} L L n` Figure2 shows the distribution, by of a boxplot,2 of posterior probability of each observation being{ } assigned to the inferred likelihood weights wd for 10 independent sim- the likelihood model ` can be obtained as ulations of Algorithm1 with 500 iterations on 10 indepen- wd p (xd z , bd) p(sd = ` wd, Z, bd ) = ` ` n n ` . dent datasets generated with the parameters detailed in the n ` P d | d d | { } 0 d w 0 p 0 (x z , b 0 ) ` ∈L ` ` n| n ` figure. Reassuringly we observe that the proposed method Sampling likelihood weights. We assume the prior distri- identifies interval data as the most likely type of data for the bution on the vector wd to be a Dirichlet distribution with three considered Beta distributions; moreover, as the tail of the Beta distribution increases, so does the weight given to parameters α` `∈Ld . Then, by conjugacy, we can sam- ple wd given{ the} likelihood assignments S from a Dirichlet the positive real-valued variables. This effect can be ex- P d plained by the finite size of the dataset, since it is hard to distribution with parameters α` + n δ(sn == `) `∈Ld . { } determine whether the variable is limited to values smaller Scalability. The overall complexity of Algorithm1 is than θL, or we simply have not observed them in the finite (NDL + K3) per iteration, where N is the number O max set of observations. A similar effect occurs when applying of objects, D the number of attributes, Lmax the maximum our method to data sampled from Gamma (Figure2(e)-(h)) number of considered data types (or likelihood models) and and Gaussian (Figure2(i)-(l)) distributions. Here, we ob- K the size of the low-rank representation. In all of our ex- serve that in addition to, respectively, positive real-valued periments, we run the MCMC for 5000 iterations, which and real-valued data types, our model finds that the variable last 10-100 minutes depending on the dataset. may also be of interval data type. This effect is larger for 4. Evaluation Gaussian variables, since in this example the Gaussian is a 4.1. Experiments on synthetic data more heavy-tailed distribution than the Gamma. In this section, we show that the proposed method is able 2In a boxplot, the central mark is the , the edges of the to accurately discover the true statistical type of variables box are the 25th and 75th percentiles, the whiskers extend to the in synthetic datasets, where we have perfect knowledge of 10th and 90th percentiles. the distribution from which the data have been generated. Automatic Discovery of the Statistical Types of Variables in a Dataset

1 1 Table 1. Information on real datasets

Categorical Categorical d d Dataset N D # of Discrete # of Binary

0.8 Ordinal 0.8 Ordinal w w Count 0.6 Count 0.6 Abalone 4, 177 9 2 0

0.4 0.4 Adult 32, 561 15 12 2 Weights Weights Weights Weights 0.2 0.2 Chess 28, 056 7 7 0 0 0 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 3 5 7 9 Dermatology 366 35 35 0 # of Categories # of Categories 1 1 German 1000 21 20 4 Categorical

0.8 0.8 Student 395 33 33 13 d

d Ordinal w w 0.6 Count 0.6 Wine 177 14 2 0

0.4 0.4 Weight Weight Weight Weight 0.2 0.2 0 0 the top row of Figure3(a)-(b) that i) as the number of cat- 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 # of Latent Variables (K) # of Latent Variables (K) egories R in the discrete variable decreases, the harder is (a) Categorical Data (b) Ordinal Data to distinguish between ordinal and categorical data, i.e., to 1 find out whether the data takes values in a ordered set or Categorical 0.8 d Ordinal

w in an unordered set; and ii) as R in ordinal data increases, 0.6 Count 0.4 the ordinal variable is more likely to be identified as count

Weight Weight 0.2 data. Both of these effects are intuitively sensible. 0 1 3 5 1 3 5 1 3 5 # of Latent Variables (K) 4.2. Experiments on real data (c) Count Data In this section, we evaluate the performance of the pro- Figure 3. [Synthetic Discrete Data] Distribution of the inferred posed method on seven real datasets collected from the likelihood weights wd when the ground truth data is (a) categor- UCI machine learning repository (Lichman, 2013). Table1 ical, (b) ordinal, and (c) count data. For categorical and ordinal summarizes theses datasets by providing the number of ob- data, we plot the likelihood weight distribution with respect to jects and attributes in the dataset, as well as how many of both the number of categories in the data and the model complex- these attributes are discrete. ity K, and for count data with respect to K. In order to quantitatively evaluate the performance of the Next, we study whether the proposed model is able to dis- proposed method, we select at random 10% of the observa- ambiguate among different discrete types of variables, par- tions in each dataset as a held-out set and compare the pre- ticularly, among categorical, ordinal and count data. To dictive performance, in terms of average test log-likelihood this end, we generate three types of datasets of size 1, 000. per observation, of our method with a baseline method. In the first type we account for categorical data by sam- The baseline method corresponds to a latent feature model pling a multinomial variable with R categories, where the in which all the continuous variables are modeled as real- probability of the categories is sampled from a Dirichlet valued data and the discrete variables as categorical data. distribution. Then, for each category we sample a multidi- Figure4 shows the obtained results for our method (solid mensional Gaussian centroid that corresponds to the mean line) and the baseline (dashed line) for several values of the of the multivariate Gaussian observations that complete the model complexity (i.e., the number of latent features K) dataset. To account for ordinal observations, we first sam- averaged over 10 independent runs of the corresponding ple the first variable in our dataset from a uniform distribu- inference algorithms. Here, we observe that i) both meth- tion in the interval (0,R), which we randomly divide into ods provide robust results with respect to the number of R categories that correspond to the ordinal variable in our variables K; and ii) our method clearly outperforms the dataset. Finally, to account for count data we first gener- baseline in all the datasets, except for the Student dataset ate a Gamma variable sampled from Γ(α, α/4), and then where the baseline performs slightly better. In other words, generate the counting variable in the dataset by taking the this figure shows that by taking into account the uncertainty floor of the Gamma variable. For both categorical and or- in the statistical types of the variables, we provide a better dinal data, we generate 10 independent datasets for each fitting of the data. value of the number of categories R 3,..., 10 , and for Additionally, Table2 shows the list of (non-binary) at- count data we generate another 10 datasets∈ { for each} value tributes in the Adult and the German datasets together with of α 2,..., 8 . Figure3 summarizes the likelihood the data types with larger inferred likelihood weights,3 i.e., weights∈ obtained { for} each type of datasets (i.e., for each the discovered statistical data types. Here, the number in type of discrete variable) after running on each dataset 10 parenthesis corresponds to the observed number of cate- independent simulations of Algorithm1 with 500 iterations gories in discrete data. The very heterogeneous nature of for different model complexity values, i.e., for different these datasets explains the substantial gain observed in Fig- numbers of latent feature variables K = 1,..., 10. In this ure4. Moreover, Table2 shows some expected results, e.g., figure we can observe that we can accurately discover the true type of discrete variable robustly and independently 3In cases in which two data types present very similar likeli- of the assumed model complexity K. We also observe on hood weights (< 10% difference), we display both of them. Automatic Discovery of the Statistical Types of Variables in a Dataset

w = 0.99, w = 0.00, w = 0.00 w = 0.07, w = 0.92, w = 0.01 Abalone Adult Chess Dermatology German Student Wine Re Re+ Int Re Re+ Int 200 150 0 0 150 -1 -0.5 100 100 -2 -1 50 50

-3 -1.5 0 0 0 0.2 0.4 0.6 0.8 1 0 1 2 3 Test log-likelihood Test -4 -2 2 4 6 8 10 2 4 6 8 10 (a) Length (mm) (b) Weight (grams) Number of Latent Variables (K) Number of Latent Variables (K) Figure 5. [Abalone dataset] Figure 4. [Real Data] Comparison between our model (solid) and w = 0.22, w = 0.56, w = 0.22 w = 0.16, w = 0.22, w = 0.61 cat ord count cat ord count the baseline (dashed) in terms of average test log-likelihood per 200 60 150 observation evaluated on a held-out set containing 10% of the ob- 40 servations in each dataset. 100 20 50 Table 2. Inferred data types. 0 0 Adult German 0 20 40 60 80 0 20 40 60 Attribute Type Attribute Type (a) Duration (months) (b) Age w = 0.29, w = 0.54, w = 0.16 w = 0.31, w = 0.54, w = 0.15 age (74) ord. status account (4) cat. cat ord count cat ord count workclass (8) cat. duration (69) ord. 800 800 final weight positive credit hist. (5) cat./ord. 600 600 education (16) cat. purpose (10) cat./ord. 400 400 education num. (16) cat. amount interval 200 200 0 0 marital status (7) cat. savings (5) ord. 1 2 3 4 non-resident resident skilled highly qualified occupation (14) cat./ord. installment (5) cat./ord. (c) #of Credits (d) Job relationship (6) ord. personal status (4) cat. Figure 6. [German dataset] race (5) cat. debtors (4) ord. sex (2) binary residence (3) cat. spectively count and categorical data, they are both inferred capital-gain real property (4) cat./ord. to be ordinal data. In the case of the number of credits, capital-loss real age (57) count this can be explained by the small (finite) number of values hours per week (99) cat./ord. plans (3) cat. native-country (41) ord. housing (3) ord. that the variable takes, while in the case of the job, this as- # credits (4) ord. signment can be explained by the labels of its categories, job (4) ord. i.e., unskilled non-resident, unskilled resident, skilled em- ployee{ and highly qualified employee , which clearly rep- marital status and race are identified as categorical, while } the age is of count data type for both datasets. However, resent an ordered set. other results might seem surprising. For example, the du- From these results, we can conclude that i) our model accu- ration (in months), which one would expect it to be count rately discovers the true statistical type of the data, which data, is identified as ordinal; or the a priori categorical at- might not be easily extracted from its documentation; and tributes native country and job are inferred to be ordinal. by doing so, ii) it provides a better fit of the data. Moreover, In order to better understand these results, we show the his- apparent failures are in fact sensible when data histograms tograms of several variables in these datasets and the as- are carefully examined. sociated inferred likelihood weights. Figure5 shows the 5. Conclusions histograms of two continuous variables, length and weight In this paper, we presented the first approach to automat- of the Abalone dataset, which take only positive real val- ically discover the statistical types of the variables in a ues, but are assigned to different data types (respectively, to dataset. Our experiments showed that the proposed ap- real-valued and positive real-valued data). This can be ex- proach accurately infers the data type, or equivalently like- plained by the fact that, while the distribution of the length lihood model, that best fits the data. presents large tails, the distribution of the weight is clearly truncated at zero. Additionally, Figure6(a)-(b) shows two Our work opens many interesting avenues for future work. discrete variables, the duration (in months) and the age in For example, it would be interesting to extend the pro- German data, which based on the documentation are ex- posed method to account for other data types. We would pected to be count data. However, our model assigns the like to include directional data, also called circular data, duration to ordinal data. This result can be explained by the which arise in a multitude of data-modelling contexts rang- irregular distribution that this variable has. In count data ing from robotics to the social sciences (Navarro et al., the distance between every two consecutive values should 2016). Moreover, since the proposed method can be seen be roughly the same (there is the same distance from “1 as a likelihood selection method, it would be interesting to pen” to “2 pens” as from “2 pens” to “3 pens”, that is 1 study how to incorporate our framework in any statistical pen), resulting therefore in smooth probability mass func- machine learning tool, where the likelihood model, instead tions. We found in Figure6(c)-(d) that while the number of of being fixed a priori, would be inferred directly from the credits and the job variables can be a priori thought as re- data jointly with the rest of the model parameters. Automatic Discovery of the Statistical Types of Variables in a Dataset

Acknowledgement Griffiths, T. L. and Ghahramani, Z. The Indian buffet pro- Isabel Valera acknowledges her Humboldt Research Fel- cess: an introduction and review. Journal of Machine lowship for Postdoctoral Researchers, which funded this Learning Research, 12:1185–1224, 2011. research during her stay at the Max Planck Institute for Hellerstein, J. M. Quantitative data cleaning for large Software Systems. databases, 2008.

References Hernandez-Lobato, J. M., Lloyd, J. R., Hernandez-Lobato, D., and Ghahramani, Z. Learning the semantics of dis- Agresti, A. Analysis of ordinal categorical data, volume crete random variables: Ordinal or categorical? In NIPS 656. John Wiley & Sons, 2010. Workshop on Learning Semantics, 2014. Ando, T. Bayesian model selection and statistical model- Hilbe, J. M. Negative . Cambridge Uni- ing. CRC Press, 2010. versity Press, 2011.

Breiman, L. Statistical modeling: The two cultures (with Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, comments and a rejoinder by the author). Statistical sci- F., Riche, N. H., Weaver, C., Lee, B., Brodbeck, D., and ence, 16(3):199–231, 2001. Buono, P. Research directions in data wrangling: Vi- sualizations and transformations for usable and credible Burnham, K. P and Anderson, D. R. Model selection and data. Information Visualization, 10(4):271–288, 2011. multimodel inference: a practical information-theoretic approach. Springer Science & Business Media, 2003. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Chu, W. and Ghahramani, A. Gaussian processes for ordi- nal regression. Journal of Machine Learning Research, Navarro, A. KW, Frellsen, J., and Turner, R. E. The mul- 6(Jul):1019–1041, 2005a. tivariate generalised von mises: Inference and applica- tions. arXiv preprint arXiv:1602.05003, 2016. Chu, W. and Ghahramani, Z. Gaussian processes for or- dinal regression. J. Mach. Learn. Res., 6:1019–1041, Salakhutdinov, R. and Mnih, A. Probabilistic matrix factor- December 2005b. ISSN 1532-4435. ization. In Advances in Neural Information Processing Systems, 2007. Dong, X. Luna and Srivastava, D. Big data integration. In Data Engineering (ICDE), 2013 IEEE 29th Interna- Schmidt, M. N, Winther, O., and Hansen, L. K. Bayesian tional Conference on, pp. 1245–1248. IEEE, 2013. non-negative matrix factorization. In International Con- ference on Independent Component Analysis and Signal Girolami, M. and Rogers, S. Variational Bayesian multi- Separation, pp. 540–547. Springer, 2009. nomial probit regression with Gaussian process priors. Neural Computation, 18:2006, 2005. Valera, I. and Ghahramani, Z. General table completion using a Bayesian nonparametric model. In Advances in Neural Information Processing Systems 27, 2014.