Generating a Synthetic Dataset for Kidney Transplantation Using Generative Adversarial Networks and Categorical Logit Encoding

GENERATING A SYNTHETIC DATASET FOR KIDNEY TRANSPLANTATION USING GENERATIVE ADVERSARIAL NETWORKS AND CATEGORICAL LOGIT ENCODING John Bartocci A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE May 2021 Committee: Robert C. Green II, Advisor Sankardas Roy Qing Tian Copyright ©May 2021 John Bartocci All rights reserved iii ABSTRACT Robert C. Green II, Advisor A synthetic data set for kidney transplantation is developed using a Wasserstein generative adversarial network (WGAN) and a donor-recipient HLA matching algorithm. Like many medical data sets, much of the kidney transplant data set is categorical. A new method for dealing with categorical data in GANs is proposed and the results are analyzed. The real-valued data is prepared with a process similar to one-hot encoding, but instead of using ones and zeros, the values are logit in nature, with a large positive value corresponding to one and large negative value corresponding to zero. By capturing the logit distributions and correlations between categories within a GAN, the generator can create a synthetic version of the data that will resemble the real, un-encoded data set after a softmax function is applied and values are stochastically selected. While the statistical met- ric used demonstrated that the synthetic and real data sets did not come from the same distribution, a visual inspection shows general similarity between the two data sets. A rematching simulation performed on the real and synthetic data set shows relatively similar results. iv To my wife, for all the love and encouragement during these interesting times. v ACKNOWLEDGMENTS I would like to thank Dr. Green for his support and guidance during my graduate career. His dedication to his students was an inspiration. I would also like to acknowledge the help that Dr. Bekbolsynov provided during the early phases of my thesis when working with drug regimens. I would also like to thank Dr. Tian and Dr. Roy for their valuable feedback on my thesis. The data reported here have been supplied by the Hennepin Healthcare Research Institute (HHRI) as the contractor for the Scientific Registry of Transplant Recipients (SRTR). The interpretation and reporting of these data are the responsibility of the author(s) and in no way should be seen as an official policy of or interpretation by the SRTR or the U.S. Government. Principles of the Helsinki declaration were followed when working with SRTR data. vi TABLE OF CONTENTS Page CHAPTER 1 INTRODUCTION . 1 CHAPTER 2 RELATED WORKS AND BACKGROUND INFORMATION . 4 2.1 Generative Adversarial Neural Networks . 4 2.2 Synthetic Electronic Health Records . 7 2.3 Privacy Evaluation . 7 2.4 Synthetic Data Quality Evaluation . 7 2.5 Generation of Kidney Transplant Data . 8 CHAPTER 3 METHODOLOGY . 9 3.1 Data Set Preparation . 9 3.1.1 Data Cleaning . 10 3.1.2 Induction Drug Columns . 10 3.1.3 Maintenance Drug Columns . 13 3.1.4 Data Set Summary . 14 3.2 Method Overview . 16 3.3 Categorical Encoding . 17 3.4 WGAN with Logit-based Critic (WGAN-LC) . 18 3.5 Random Sampling of Conditional Probability Distributions . 20 3.6 Donor and Recipient Matching Algorithm . 22 CHAPTER 4 RESULTS . 24 4.1 Platform . 24 4.2 Patient Medical Profile Results . 25 4.3 Age Assignment Results . 31 4.4 Donor and Recipient Matching Results . 34 4.5 Rematching Results . 40 vii CHAPTER 5 CONCLUSION . 42 5.1 Future Work . 42 BIBLIOGRAPHY . 44 APPENDIX A SELECT PYTHON CODE FRAGMENTS . 48 viii LIST OF FIGURES Figure Page 2.1 Generic GAN setup. 5 3.1 Example of the process used to develop the induction drug columns. 12 3.2 Example of the process used to develop the maintenance drug columns. 16 3.3 Example of the process used to encode the categorical data. 18 3.4 WGAN architecture using logit-encoded categorical data. 19 4.1 Patient profile frequency distributions, part 1 of 5. 26 4.2 Patient profile frequency distributions, part 2 of 5. 27 4.3 Patient profile frequency distributions, part 3 of 5. 28 4.4 Patient profile frequency distributions, part 4 of 5. 29 4.5 Patient profile frequency distributions, part 5 of 5. 30 4.6 Comparison between the real donor and recipient data CW1 HLA distributions. 31 4.7 Age distribution plots for real and synthetic data sets. 32 4.8 Age distribution plots for real and synthetic data sets for white males. 33 4.9 Distribution plots for donor columns in matched data sets, part 1 of 2. 36 4.10 Distribution plots for donor columns in matched data sets, part 2 of 2. 37 4.11 Distribution plots for recipient columns in matched data sets, part 1 of 2. 38 4.12 Distribution plots for recipient columns in matched data sets, part 2 of 2. 39 4.13 Rematching results on real and synthetic data sets. 41 ix LIST OF TABLES Table Page 3.1 Donor column descriptions and data types [4]. 10 3.2 Candidate column descriptions and data types [4]. 11 3.3 Transplant column descriptions and data types [4]. 11 3.4 The disposition of missing data that was not dropped. 12 3.5 Induction drug categories. 13 3.6 Drugs categorized as ”other”. 13 3.7 Description of categories used to summarize the induction data. 14 3.8 Maintenance drug categories. 15 3.9 Maintenance regimens. 15 3.10 Description of categories used to summarize the maintenance regimens. 16 3.11 Patient information columns and their data types that are generated by the WGAN. 21 3.12 Summary of the dimensions and activation functions in the generator architecture. 21 3.13 Summary of the dimensions and activation functions in the critic architecture. 21 3.14 Key values used in the generator and critic models. 22 4.1 Fisher-Exact results for the patient profile data generated by the WGAN. 25 4.2 Fisher-Exact results for the paired patient data. 35 1 CHAPTER 1 INTRODUCTION Machine learning has seen major advances in a wide variety of domains in recent years. While the medical domain has seen its share of advances, one major challenge that has impeded its progress is the lack of publicly available data sets. Medical data, by its very nature, routinely contains protected health information (PHI) and in the US is regulated by HIPAA privacy rules. According to [1], one possibility to solve this issue is through de-identification of the PHI. When de-identification is not feasible due to restrictions or disclosure risk, another option is generating a synthetic data set. Specifically, this research focused on assembling a synthetic dataset for kidney transplantation. While generating a synthetic data set solves many of the privacy concerns, it is not a panacea for privacy [2]. Synthetic data also has its own challenges, such as evaluating the quality of the data set, handling categorical and time-series data, and architecture specific challenges such as mode col- lapse with generative adversarial networks (GANs) [3]. While knowledge of the domain-specific nature of any given medical data set would be helpful, it is not a requirement to understanding the more general issues and methods presented in this thesis for handling and generating categorical data. The SRTR Kidney transplant dataset [4] has a significant number of categorical columns with many different levels. In order to generate a synthetic version of such a dataset, the more general issue of synthesizing categorical data with GANs must be addressed. This research attempted to simplify the deployment of GANs for categorical data, by training in the latent space that contains the continuous distribution of logits by encoding the real data in a logit-like manner. This elimi- nated the need for stochastic selection from softmax during training or for a distribution that can be hardened via a temperature parameter, like the Gumbel-Softmax [5]. This thesis attempted to reduce the number of joint distributions to emphasize the other joint distributions that are captured by breaking the data generation up into functional units. This synthetic data generation treated the generation of the donor and recipient data separately and then 2 worked to combine them using a matching simulation. This functional data aggregation should in- crease the level of privacy in the synthetic data set while attempting to maintain the most important join distributions. While a statistical analysis is performed on the resultant data using a Fisher-Exact test, the limitations of statistical analysis on nominal categorical data necessitate some visual inspection of the distributions. The results of the Fisher-Exact test demonstrate that the synthetic data set generated by this method and another data set generated by the Synthetic Data Vault (SDV) [6] both fail to pull from the same distribution as the real data set. The distribution plots, however, show that some of the distribution information has been captured with this method. Additionally, the machine learning efficacy test provides some hints that the synthetic data set can generate some machine learning insights for rematching algorithms. Considering these issues, this thesis makes three contributions regarding data related to kidney transplantation including: 1. A proposal for a new method to train a WGAN using raw logits for categorical data as direct input to a critic; 2. A phased process to build up a synthetic data set using different methods; and 3. The application of the previous two contributions to partially generate a synthetic data set for kidney transplants that is similar to the actual data set. Stated differently, this thesis centers on answering three questions: 1. Can a Wasserstein GAN (WGAN) generate categorical data while avoiding the need for Gumbel-Softmax by using a logit output layer on the generator and encoding the real data in a logit-like scheme? 2.

Load more