Independent Vector Analysis for Molecular Data Fusion: Application to Property Prediction and Knowledge Discovery of Energetic Materials
Total Page:16
File Type:pdf, Size:1020Kb
Independent Vector Analysis for Molecular Data Fusion: Application to Property Prediction and Knowledge Discovery of Energetic Materials Zois Boukouvalas∗, Monica Puerto∗, Daniel C. Eltony, Peter W. Chungz, and Mark D. Fugez ∗Dept. of Mathematics and Statistics, American University, Washington, DC 20016 yImaging Biomarkers and Computer-Aided Diagnosis Lab, Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892 zDept. of Mechanical Engineering, University of Maryland, College Park, College Park, MD 20742 Abstract—Due to its high computational speed and accu- string [6]. Although working directly with SMILES strings has racy compared to ab-initio quantum chemistry and forcefield shown to be effective in some ML tasks [7], most of the ML modeling, the prediction of molecular properties using machine methods require vector or matrix variables rather than strings. learning has received great attention in the fields of materials design and drug discovery. A main ingredient required for Basic classes of featurization methods that have been widely machine learning is a training dataset consisting of molecular used in the literature include cheminformatic descriptors, features—for example fingerprint bits, chemical descriptors, etc. molecular fingerprints, and custom graph convolution based that adequately characterize the corresponding molecules. How- fingerprints, where each featurization method provides differ- ever, choosing features for any application is highly non-trivial, ent – though not necessarily unique – information about a since no “universal” method for feature selection exists. In this work, we propose a data fusion framework that uses Independent molecule. Vector Analysis to uncover underlying complementary informa- Thus, it is important to find answers to the question ”how tion contained in different molecular featurization methods. Our can disparate datasets, each associated with unique featur- approach takes an arbitrary number of individual feature vectors ization methods, be integrated?” The question is motivated and generates a low dimensional set of features—molecular by the desire to create automated approaches where data signatures—that can be used for the prediction of molecular properties and for knowledge discovery. We demonstrate this on generation techniques are integrable with ML models. Data a small and diverse dataset consisting of energetic compounds fusion methods [8], [9] may serve this purpose, since they for the prediction of several energetic properties as well as for enable simultaneous study of multiple datasets by, for instance, demonstrating how to provide insights onto the relationships exploiting alignments of data fragments where there is a between molecular structures and properties. common underlying latent space. Index Terms—Data fusion, blind source separation, inter- pretability A naive approach could be to simply concatenate data that has been generated by different featurization methods. However, this typically leads to a curse of dimensionality, that I. INTRODUCTION could affect the performance of a ML algorithm and make Machine learning (ML) has recently been used for the it impossible to discover the features of greatest importance. prediction of molecular properties and studies have shown Therefore, selecting a model that generates a set of molec- that it can provide accurate and computationally efficient ular feature vectors by effectively exploiting complementary solutions for this task [1]–[5]. The prediction ability of a ML information among multiple datasets is an important issue. model highly depends on the proper selection of a training Blind source separation (BSS) techniques enable the joint dataset, via feature vectors, that can fully capture certain analysis of datasets and extraction of summary factors with characteristics of a given set of molecules. A common way to few assumptions placed on the data. This is generally achieved represent a molecule is through a string representation called through the use of a generative model. One of the most widely the Simplified Molecular-Input Line-Entry System (SMILES) used BSS techniques is independent component analysis (ICA) [10], [11]. The popularity is likely because by only requiring Support for this work is gratefully acknowledged from the U.S. Office of Naval Research under grant number N00014-17-1-2108 and from the statistical independence of the latent sources it can uniquely Energetics Technology Center under project number 2044-001. Partial support identify the true latent sources, subject only to scaling and is also acknowledged from the Center for Engineering Concepts Development permutation ambiguities. However, ICA can decompose only in the Department of Mechanical Engineering at the University of Maryland, College Park. We thank Dr. Ruth M. Doherty and Dr. Bill Wilson from the a single dataset. In many applications multiple sets of data Energetics Technology Center for their encouragement, useful thoughts, and are gathered about the same phenomenon. These sets can for proofreading the manuscript. often be disjoint and merely misaligned fragments of a shared Daniel C. Elton contributed to this article in his personal capacity. The views expressed are his own and do not necessarily represent the views of underlying latent space. Multiple datasets could therefore share the National Institutes of Health or the United States Government. dependencies. This motivates the application of methods that 978-9-0827-9705-3 1030 EUSIPCO 2020 can jointly factorize multiple datasets, like independent vector could perform ICA separately on each dataset and align the analysis (IVA) [11]. IVA is a recent generalization of ICA to subsequent results. However, this approach could be consid- multiple datasets that can achieve improved performance over ered suboptimal since performing ICA individually on each performing ICA on each dataset separately. dataset will ignore any dependencies that exist among them. In this paper, we propose a novel data fusion framework that uses IVA to exploit underlying complementary information A. Independent Vector Analysis contained across multiple datasets. Namely, we show that It is common for datasets that have been generated from dif- information that has been generated by different molecular ferent featurization methods to have some inherent dependence featurization methods can be fused to improve learning of the among them. IVA generalizes the ICA problem by allowing chemical relationships. for full exploitation of this dependence leading to improved Note here that achieving perfect regression error is not the performance beyond what is achievable by applying ICA the goal of this work. Instead, our goal is to determine how to separately to each dataset. Additionally, IVA automatically generate feature vectors by combining datasets from multiple aligns dependent sources across the datasets, thus bypassing featurization methods to improve the learned response of the the need for a second permutation correction algorithm. data-fused model over the performance of the individual fea- IVA is similar to ICA, except that now we have K datasets, ture vectors treated separately. Particularly noteworthy is that x[k], k = 1; :::; K where each dataset is a linear mixture the proposed approach is completely unsupervised, parameter of N statistically independent sources. Under the assumption free, computationally attractive when compared with existing that PCA preprocessing results in d = P , the noiseless IVA methods [5], easily interpretable due to the simplicity of the model is given by x[k] = A[k]s[k]; k = 1; :::; K; where generative model, and does not require a large amount of A[k] 2 RP ×P ; k = 1; :::; K are invertible mixing matrices training samples in order to achieve a desirable regression [k] [k] [k] > and s = [s1 ; :::; sP ] is the vector of latent sources for error. the kth dataset. In the IVA model, the components within The remainder of this paper is organized as follows. In each s[k] are assumed to be independent, while at the same Section (II), we provide a brief background on the IVA method time, dependence across corresponding components of s[k] is and the regression procedure. Section (III) provides the results allowed. To mathematically formulate the dependence across of the regression procedure and associated discussions. The components that IVA can take into account, we define the conclusions and future research directions are presented in source component vector (SCV) by vertically concatenating Section (IV). the pth source from each of the K dataset as II. MATERIALS AND METHODS [1] [K] > sp = [sp ; :::; sp ] ; (2) The BSS model is formulated as follows. Let X 2 Rd×N be the observation matrix where d denotes the dimension of the where sp is a K-dimensional random vector. The goal in IVA is to estimate K demixing matrices to yield source feature vector and N denotes the total number of molecules. [k] [k] [k] The noiseless BSS generative model is given by estimates y = W x , such that each SCV is maximally independent of all other SCVs. It should be mentioned that X = AS; (1) dependence across datasets is not a necessary condition for where A 2 d×P is the mixing matrix, and S 2 P ×N is the R R IVA to work. In the case where this type of statistical property matrix that contains the sources that need to be estimated and does not exist, IVA reduces to individual ICAs on each dataset. will be used as the new feature vector for the ML task. One The IVA