Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data Sim~aoEduardo∗1 Alfredo Nazábal∗2 Christopher K. I. Williams12 Charles Sutton123 1School of Informatics, University of Edinburgh, UK 2The Alan Turing Institute, UK; 3Google Research Abstract et al., 2017), annotations of anomalous cells are often not readily available in practice. Instead, unsupervised We focus on the problem of unsupervised cell OD attempts to infer the underlying clean distribution, outlier detection and repair in mixed-type tab- and explains outliers as instances that deviate from that ular data. Traditional methods are concerned distribution. It is important to focus on the joint dis- only on detecting which rows in the dataset tribution over features, because although some outliers are outliers. However, identifying which cells can be easily identified as anomalous by considering corrupt a specific row is an important problem only the marginal distribution of the feature, many in practice, and the very first step towards others are only detectable within the context of the repairing them. We introduce the Robust other features (section 2.2 of (Chandola et al., 2009)). Variational Autoencoder (RVAE), a deep gen- Recently deep models outperformed traditional ones erative model that learns the joint distribution for tabular data tasks (Klambauer et al., 2017), cap- of the clean data while identifying the out- turing their underlying structure better. They are an lier cells, allowing their imputation (repair). attractive choice for OD, since they have the flexibility RVAE explicitly learns the probability of each to model a wide variety of clean distributions. How- cell being an outlier, balancing different likeli- ever OD work has mostly focused on image datasets, hood models in the row outlier score, making repairing dirty pixels instead of cells in tabular data, the method suitable for OD in mixed-type e.g. (Wang et al., 2017b; Zhou and Paffenroth, 2017; datasets. We show experimentally that not Akrami et al., 2019). only RVAE performs better than several state- Outliers present unique challenges to deep generative of-the-art methods in cell OD and repair for models. First, most work focuses on detecting anoma- tabular data, but also that is robust against lous data rows, without detecting which specific cells the initial hyper-parameter selection. in a row are problematic (Redyuk et al., 2019; Schelter et al., 2018). Work on cell-level detection and repair often focuses on real-valued features, e.g. images (Zhou 1 Introduction and Paffenroth, 2017; Wang et al., 2017b; Schlegl et al., 2017), or does not provide a principled way to detect The existence of outliers in real world data is a problem anomalous cells (Nguyen and Vien, 2018). Since focus is data scientists face daily, so outlier detection (OD) has on row OD, not enough care is given to cell granularity, been extensively studied in the literature (Chandola which means it is often difficult to properly repair the et al., 2009; Emmott et al., 2015; Hodge and Austin, dirty cells, e.g. large number of columns exist or when 2004). The task is often unsupervised, meaning that the data scientist is not a domain expert. Second, tabu- we do not have annotations indicating whether indi- lar data is often mixed-type, including both continuous vidual cells in the data table are clean or anomalous. and categorical columns. Although modelling mixed- Although supervised OD algorithms have been pro- type data has been explored (Nazabal et al., 2018; posed (Lee et al., 2018; An and Cho, 2015; Schlegl Vergari et al., 2019), the difficulty arises when han- ∗ Joint first authorship. dling outliers. Standard outlier scores are based on the probability that the model assigns to a cell, but these values are not comparable between likelihood models, Proceedings of the 23rdInternational Conference on Artificial performing poorly for mixed-type data. Finally, the Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. effect of outliers in unsupervised learning can be insidi- PMLR: Volume 108. Copyright 2020 by the author(s). ous. Since deep generative models are highly flexible, Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data they are not always robust against outliers (Hendrycks lihood pθ(xndjzn) differently for each feature type. For and Dietterich, 2019), overfitting to anomalous cells. real features pθ(xndjzn) = N (xndjmd(zn); σd), where When the model overfits, it cannot identify these cells σd is a global parameter. For categorical features as outliers, because it has modelled them as part of pθ(xndjzn) = f(ad(zn)), where ad(zn) is an unnor- the clean distribution, and consequently, most repair malized vector of probabilities for each category and proposals are skewed towards the dirty values, and not f is the softmax function. All md(zn) and ad(zn) are the underlying clean ones. parameterized by feed-forward networks. Our main contributions are: (i) Our Robust Varia- As exact inference for pθ(znjxn) is generally intractable, tional Autoencoder (RVAE), a novel fully unsupervised a variational posterior qφ(znjxn) is used; in VAEs this is deep generative model for cell-level OD and repair for also known as the encoder. It is modelled by a Gaussian mixed-type tabular data. It uses a two-component mix- distribution with parameters µ(xn) and Σ(xn) ture model for each feature, with one component for clean data, and the other component that robustifies qφ(znjxn) = N (znjµ(xn); Σ(xn)) (2) the model by isolating outliers. (ii) RVAE models the underlying clean data distribution by down-weighting where φ = fµ(xn); Σ(xn)g are feed-forward neural the impact of anomalous cells, providing a competitive networks, and Σ(xn) is a diagonal covariance matrix. outlier score for cells and a superior estimate of cell VAEs are trained by maximizing the lower bound on repairs. (iii) We present a hybrid inference scheme for the marginal log-likelihood called the evidence lower optimizing the model parameters, combining amortized bound (ELBO), given by and exact variational updates, which proves superior standard amortized inference. (iv) RVAE allows us to N D present an outlier score that is commensurate across 1 X X L = Eqφ(znjxn) [log pθ(xndjzn)] mixed-type data. (v) RVAE is robust against the selec- N n=1 d=1 tion of its hyper-parameters, while other OD methods −D (q (z jx )jjp(z )); (3) suffer from fine tuning of their parameters to each KL φ n n n specific dataset. where the neural network parameters of the decoder θ and encoder φ are learnt with a gradient-based opti- 2 Variational Autoencoders mizer. When VAEs are used for OD, typically an instance in a tabular dataset is an outlier if the expected We consider a tabular dataset X with n 2 f1; ··· ;Ng likelihood Eqφ(znjxn) [log pθ(xnjzn)] is small (An and instances and d 2 f1; ··· ;Dg features, where each cell Cho, 2015; Wang et al., 2017b). xnd in the dataset can be real (continuous), xnd 2 R, or categorical, xnd 2 f1; ::; Cdg with Cd the number of unique categories of feature d. Cells in the dataset are 3 Robust Variational Autoencoder potentially corrupted with an unknown noising process (RVAE) appropriate for the feature type. The objective in this work is not only detecting the anomalous instances in To improve VAEs for OD and repair, we want to make the dataset, termed row outliers, but also determining them more robust, by automatically identifying poten- the specific subset of cells that are anomalous, termed tial outliers during training, so they are downweighted cell outliers, proposing potential repair values for them. when training the generative model. We also want A common approach to unsupervised OD is to build a cell-level outlier score which is comparable across a generative model p(X) that models the distribution continuous and categorical attributes. We can achieve of clean data. A powerful class of deep generative both goals by modifying the generative model. models are variational autoencoders (VAEs) (Kingma We define here our robust variational autoencoder and Welling, 2014), which model p(X) as (RVAE), a deep generative model based on a two- N component mixture model likelihood (decoder) per fea- Y Z p(X) = dzn p(zn)pθ(xnjzn) (1) ture, which isolates the outliers during training. RVAE n=1 is composed of a clean component pθ(xndjzn) for each dimension d, explaining the clean cells, and an outlier QD where pθ(xnjzn) = d=1 pθ(xndjzn) and pθ(xndjzn) is component p0(xnd), explaining the outlier cells. The K the conditional likelihood of feature d, zn 2 R is mixing variable wnd 2 f0; 1g acts as a gate to determine the latent representation of instance xn, and p(zn) = whether cell xnd should be modelled by the clean com- N (0; I) is an isotropic multivariate Gaussian prior. To ponent (wnd = 1) or the outlier component (wnd = 0). handle mixed-type data, we choose the conditional like- We define the marginal likelihood of the mixture model Sim~aoEduardo∗1, Alfredo Nazábal∗2, Christopher K. I. Williams12, Charles Sutton123 model under dataset X as1 3.1 Outlier Model N Y X Z The purpose of the outlier distribution p (x ) is to p(X) = dz p(z )p(w )p(x jz ; w ); (4) 0 nd n n n n n explain the outlier cells in the dataset, completely n=1 w n removing their effect in the optimization of the pa- D Y rameters of clean component p . For categorical p(x jz ; w ) = p (x jz )wnd p (x )1−wnd ; (5) θ n n n θ nd n 0 nd features, we propose using the uniform distribution d=1 −1 p0(xnd) = Cd .

Load more