Nonparametric Statistical Inference and Imputation for Incomplete Categorical Data∗

Statistics and Its Interface Volume 0 (2014) 1 Nonparametric Statistical Inference and Imputation for Incomplete Categorical Data∗ Chaojie Wang, Linghao Shen, Han Li and Xiaodan Fany Book1 Book2 Book3 Book4 Book5 Customer1 1 1 0 Missingness in categorical data is a common problem in Customer2 1 0 0 various real applications. Traditional approaches either uti- Customer3 1 1 lize only the complete observations or impute the missing Customer4 1 1 data by some ad hoc methods rather than the true condi- Customer5 0 1 0 tional distribution of the missing data, thus losing or dis- Customer6 0 1 0 torting the rich information in the partial observations. In this paper, we propose the Dirichlet Process Mixture of Col- Table 1. Toy data example with the goal to infer the empty lapsed Product-Multinomials (DPMCPM) to model the full cells. 1: bought; 0: recommended but not bought; empty: data jointly and compute the model efficiently. By fitting an missing values. infinite mixture of product-multinomial distributions, DPM- CPM is applicable for any categorical data regardless of the true distribution, which may contain complex association When dealing with datasets with missingness, naive ap- among variables. Under the framework of latent class anal- proaches, such as the complete-case analysis (CCA) and ysis, we show that DPMCPM can model general missing overall mean imputation, would waste the information in mechanisms by creating an extra category to denote miss- the missing data and may bias the inference [2]. When missingness, which implicitly integrates out the missing part ingness is high, CCA is hardly applicable due to the lack of with regard to their true conditional distribution. Through complete cases. Advanced methods, such as multiple impu- simulation studies and a real application, we demonstrate tation [18, 20], impose a parametric model on the data and that DPMCPM outperforms existing approaches on statis- then draw multiple sets of samples to account for the un- tical inference and imputation for incomplete categorical certainty of the missing information. For categorical cases, data of various missing mechanisms. DPMCPM is imple- [19] advocated the log-linear model for multiple imputation, mented as the R package MMDai, which is available from which can capture certain types of association among the the Comprehensive R Archive Network at https://cran.r- categorical variables. However, this model works only when project.org/web/packages/MMDai/index.html. the number of variables is small, as the full multi-way cross- tabulation required for the log-linear analysis increases ex- Keywords and phrases: infinite mixture model, product- multinomial distribution, missing data, imputation. ponentially with the number of variables [27]. There are two basic ideas for imputing multivariate missing data: fully conditional specification (FCS) and joint 1. INTRODUCTION modeling. FSC [25] specifies a collection of univariate conditional imputation models that condition on all the other arXiv:1712.02214v2 [stat.ME] 11 Jul 2019 Missingness in categorical data is a common problem in many real applications. For examples, in social surveys, variables. A popular application based on FCS is known as data collected by questionnaires are often incomplete be- Multiple Imputation by Chained Equation (MICE), which cause subjects may be unwilling or unable to respond to specifies a sequence of regression models iteratively [24, 31]. some items [11]. In biological experiments, data may be in- [23] provided the R package mice to implement this method complete for either biologically-driven or technically-driven efficiently. Although it has been shown to work well for many reasons [7]. In recommendation system problems [16], ana- datasets and become a popular method [9], MICE still has lysts often have a dataset as the toy example in Table1. The some common drawbacks of FCS. A typical application of goal is to predict the potential purchase behavior without MICE is to use multinomial logistic regression for the cat- direct observation, e.g., whether Customer3 and Customer4 egorical data, but the relationship among variables may be would buy Book5. nonlinear and may involve complex interaction or higher- order effects [15]. Besides, there is no guarantee that the ∗ The authors gratefully acknowledge three grants from the Research iterations of sequential regression model will converge to Grants Council of the Hong Kong SAR, China (400913, 14203915, 14173817) and two CUHK direct grants (4053135, 3132753). the true posterior distribution of missing values [27]. Other yCorresponding author: [email protected] parametric approaches include missMDA based on principal component analysis [10], MIMCA based on correspondence real application, we demonstrate DPMCPM performs bet- analysis [1], and so on. These parametric methods can be ap- ter statistical inference and imputation than existing ap- plied in some specific problems but not in general cases. [14] proaches. To our knowledge, this is the first non-parametric provided a detailed review for current advances of multiple tool which can model arbitrary categorical distributions and imputation. handle high missingness. From the perspective of data analysts, no matter the This paper is organized as follows. Section2 introduces data is raw with missingness or has been imputed when the DPMCPM model and the Gibbs sampler algorithm. Sec- arriving, it is important to understand the detailed mecha- tion3 performs simulation studies on the synthetic data and nisms of pre-processing, including imputation. [32] reviewed Section4 presents a real application in a recommendation and discussed the imputation danger caused by the differ- system problem. Section5 concludes the paper. ence of the God's model, imputer's model and analyst's 2. METHOD model. They pointed that any attempt of pre-processing to make the data \more usable" implies potential assump- 2.1 Dirichlet Process Mixture of tions. In this aspect, joint modeling provides a powerful tool Product-Multinomials to model the underlying distribution behind the data. For categorical data, [27] proposed a latent class model, i.e., We begin with the finite mixture product-multinomial the finite mixture of product-multinomial model. The la- model for the case of complete dataset x = fxijg, where tent class model can characterize both simple association i = 1; ··· ; n and j = 1; ··· ; p. Suppose x comprises of n independent samples associated with p categorical variables, and complex higher-order interactions among the categori- and the j-th variable has d categories. Let x 2 f1; ··· ; d g cal variables as long as the number of latent class is large j ij j denote the observed category for the i-th sample in the j- enough [13]. [3] proposed the Dirichlet Process Mixture of th variable. The finite mixture product-multinomial model Products of Multinomial distributions (DPMPM) to model assumes that those x are generated from a multinomial complete multivariate categorical datasets, which avoids the ij distribution indexed by a latent variable zi 2 f1; ··· ; kg. arbitrary setting of the number of mixture components. Fur- A finite mixture model with k latent components can be thermore, [3] proved that any multivariate categorical data expressed as: distribution can be approximated by DPMPM for a sufficiently large number of mixture components. [21] general- (1) x jz ; (j) ∼ multinomial( (j) ; ··· ; (j) ); ij i zi zi1 zidj ized the DPMPM framework to analyze incomplete categorical datasets, which works well for low missingness but performs poorly for high missingness since the number of (2) zijΘ ∼ multinomial(θ1; ··· ; θk); parameters would increase dramatically. Based on the work (j) (j) (j) where Θ = fθ1; ··· ; θkg and = f ; ··· ; g. We of [21], [12] introduced a variant of this model for edit- zi zi1 zidj (j) imputation, which accounts for the values that are logically further define Ψ = f h : h = 1; ··· ; k; j = 1; ··· ; pg and impossible but present due to measurement error. [8] ex- z = fzi : i = 1; ··· ; ng. tended this model to nested data structures in the presence [3] proved that any multivariate categorical data distri- of structural zeros. Other related works include the divi- bution can be represented by the mixture distribution in sive latent class model [26], Bayesian multilevel latent class Equation1 and2 for a sufficiently large k. However, speci- models [29] and so on. [28] presented a detailed overview fying a good k to avoid over-fitting and over-simplification is non-trivial, and it becomes even harder when the dataset is of recent researches on multiple imputation using the latent highly incomplete [3, 21]. This motivates the use of an infi- class model. nite extension of the finite mixture model, i.e., the Dirichlet In this paper, we propose DPMCPM, which extends process mixture. A Dirichlet process can be represented by DPMPM for modelling incomplete multivariate categorical various schemes, including the Pólya urn scheme, the Chi- data efficiently. DPMCPM inherits some nice properties of nese restaurant process and the stick-breaking construction DPMPM. It avoids the arbitrary setting of the number of [22]. [3] chose the stick-breaking construction to model the mixture components by using the Dirichlet process. In ad- Dirichlet process. However, in practice, the slice Gibbs sam- dition, DPMCPM is applicable for any categorical data re- pler in their construction may often be trapped in a single gardless of the true distribution and can capture the com- component when n is relatively large due to numeric limits plex association among variables. Different from the miss- and thus fail to identify the correct number of components ingness viewed as unknown parameters in [21], DPMCPM [30]. To avoid this drawback, we construct the Dirichlet pro- creates an extra category to denote the missingness. It shall cess by using the Chinese restaurant process in DPMCPM: reduce computation burden and gain better performance nh;−i when the missingness is high. Under the framework of the la- P (zi = hjz−i) = ; h = 1; ··· ; k; n + α − 1 tent class analysis, we show that DPMCPM can model gen- (3) α P (z = k + 1jz ) = ; eral missing mechanisms.

Load more