GENERATING A SYNTHETIC DATASET FOR KIDNEY TRANSPLANTATION USING GENERATIVE ADVERSARIAL NETWORKS AND CATEGORICAL LOGIT ENCODING

John Bartocci

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

May 2021

Committee:

Robert C. Green II, Advisor

Sankardas Roy

Qing Tian Copyright ©May 2021 John Bartocci All rights reserved iii ABSTRACT

Robert C. Green II, Advisor

A synthetic data set for kidney transplantation is developed using a Wasserstein generative ad- versarial network (WGAN) and a donor-recipient HLA matching algorithm. Like many medical data sets, much of the kidney transplant data set is categorical. A new method for dealing with cat- egorical data in GANs is proposed and the results are analyzed. The real-valued data is prepared with a process similar to one-hot encoding, but instead of using ones and zeros, the values are logit in nature, with a large positive value corresponding to one and large negative value corresponding to zero. By capturing the logit distributions and correlations between categories within a GAN, the generator can create a synthetic version of the data that will resemble the real, un-encoded data set after a softmax function is applied and values are stochastically selected. While the statistical met- ric used demonstrated that the synthetic and real data sets did not come from the same distribution, a visual inspection shows general similarity between the two data sets. A rematching simulation performed on the real and synthetic data set shows relatively similar results. iv

To my wife, for all the love and encouragement during these interesting times. v ACKNOWLEDGMENTS

I would like to thank Dr. Green for his support and guidance during my graduate career. His dedication to his students was an inspiration. I would also like to acknowledge the help that Dr. Bekbolsynov provided during the early phases of my thesis when working with drug regimens. I would also like to thank Dr. Tian and Dr. Roy for their valuable feedback on my thesis. The data reported here have been supplied by the Hennepin Healthcare Research Institute (HHRI) as the contractor for the Scientific Registry of Transplant Recipients (SRTR). The in- terpretation and reporting of these data are the responsibility of the author(s) and in no way should be seen as an official policy of or interpretation by the SRTR or the U.S. Government. Principles of the Helsinki declaration were followed when working with SRTR data. vi TABLE OF CONTENTS Page

CHAPTER 1 INTRODUCTION ...... 1

CHAPTER 2 RELATED WORKS AND BACKGROUND INFORMATION ...... 4 2.1 Generative Adversarial Neural Networks ...... 4 2.2 Synthetic Electronic Health Records ...... 7 2.3 Privacy Evaluation ...... 7 2.4 Synthetic Data Quality Evaluation ...... 7 2.5 Generation of Kidney Transplant Data ...... 8

CHAPTER 3 METHODOLOGY ...... 9 3.1 Data Set Preparation ...... 9 3.1.1 Data Cleaning ...... 10 3.1.2 Induction Drug Columns ...... 10 3.1.3 Maintenance Drug Columns ...... 13 3.1.4 Data Set Summary ...... 14 3.2 Method Overview ...... 16 3.3 Categorical Encoding ...... 17 3.4 WGAN with Logit-based Critic (WGAN-LC) ...... 18 3.5 Random Sampling of Conditional Probability Distributions ...... 20 3.6 Donor and Recipient Matching Algorithm ...... 22

CHAPTER 4 RESULTS ...... 24 4.1 Platform ...... 24 4.2 Patient Medical Profile Results ...... 25 4.3 Age Assignment Results ...... 31 4.4 Donor and Recipient Matching Results ...... 34 4.5 Rematching Results ...... 40 vii CHAPTER 5 CONCLUSION ...... 42 5.1 Future Work ...... 42

BIBLIOGRAPHY ...... 44

APPENDIX A SELECT PYTHON CODE FRAGMENTS ...... 48 viii LIST OF FIGURES Figure Page

2.1 Generic GAN setup...... 5

3.1 Example of the process used to develop the induction drug columns...... 12 3.2 Example of the process used to develop the maintenance drug columns...... 16 3.3 Example of the process used to encode the categorical data...... 18 3.4 WGAN architecture using logit-encoded categorical data...... 19

4.1 Patient profile frequency distributions, part 1 of 5...... 26 4.2 Patient profile frequency distributions, part 2 of 5...... 27 4.3 Patient profile frequency distributions, part 3 of 5...... 28 4.4 Patient profile frequency distributions, part 4 of 5...... 29 4.5 Patient profile frequency distributions, part 5 of 5...... 30 4.6 Comparison between the real donor and recipient data CW1 HLA distributions. . . 31 4.7 Age distribution plots for real and synthetic data sets...... 32 4.8 Age distribution plots for real and synthetic data sets for white males...... 33 4.9 Distribution plots for donor columns in matched data sets, part 1 of 2...... 36 4.10 Distribution plots for donor columns in matched data sets, part 2 of 2...... 37 4.11 Distribution plots for recipient columns in matched data sets, part 1 of 2...... 38 4.12 Distribution plots for recipient columns in matched data sets, part 2 of 2...... 39 4.13 Rematching results on real and synthetic data sets...... 41 ix LIST OF TABLES Table Page

3.1 Donor column descriptions and data types [4]...... 10 3.2 Candidate column descriptions and data types [4]...... 11 3.3 Transplant column descriptions and data types [4]...... 11 3.4 The disposition of missing data that was not dropped...... 12 3.5 Induction drug categories...... 13 3.6 Drugs categorized as ”other”...... 13 3.7 Description of categories used to summarize the induction data...... 14 3.8 Maintenance drug categories...... 15 3.9 Maintenance regimens...... 15 3.10 Description of categories used to summarize the maintenance regimens...... 16 3.11 Patient information columns and their data types that are generated by the WGAN. 21 3.12 Summary of the dimensions and activation functions in the generator architecture. . 21 3.13 Summary of the dimensions and activation functions in the critic architecture. . . . 21 3.14 Key values used in the generator and critic models...... 22

4.1 Fisher-Exact results for the patient profile data generated by the WGAN...... 25 4.2 Fisher-Exact results for the paired patient data...... 35 1

CHAPTER 1 INTRODUCTION

Machine learning has seen major advances in a wide variety of domains in recent years. While the medical domain has seen its share of advances, one major challenge that has impeded its progress is the lack of publicly available data sets. Medical data, by its very nature, routinely contains protected health information (PHI) and in the US is regulated by HIPAA privacy rules. According to [1], one possibility to solve this issue is through de-identification of the PHI. When de-identification is not feasible due to restrictions or disclosure risk, another option is generating a synthetic data set. Specifically, this research focused on assembling a synthetic dataset for kidney transplantation. While generating a synthetic data set solves many of the privacy concerns, it is not a panacea for privacy [2]. Synthetic data also has its own challenges, such as evaluating the quality of the data set, handling categorical and time-series data, and architecture specific challenges such as mode col- lapse with generative adversarial networks (GANs) [3]. While knowledge of the domain-specific nature of any given medical data set would be helpful, it is not a requirement to understanding the more general issues and methods presented in this thesis for handling and generating categorical data. The SRTR Kidney transplant dataset [4] has a significant number of categorical columns with many different levels. In order to generate a synthetic version of such a dataset, the more general issue of synthesizing categorical data with GANs must be addressed. This research attempted to simplify the deployment of GANs for categorical data, by training in the latent space that contains the continuous distribution of logits by encoding the real data in a logit-like manner. This elimi- nated the need for stochastic selection from softmax during training or for a distribution that can be hardened via a temperature parameter, like the Gumbel-Softmax [5]. This thesis attempted to reduce the number of joint distributions to emphasize the other joint distributions that are captured by breaking the data generation up into functional units. This syn- thetic data generation treated the generation of the donor and recipient data separately and then 2 worked to combine them using a matching simulation. This functional data aggregation should in- crease the level of privacy in the synthetic data set while attempting to maintain the most important join distributions. While a statistical analysis is performed on the resultant data using a Fisher-Exact test, the limitations of statistical analysis on nominal categorical data necessitate some visual inspection of the distributions. The results of the Fisher-Exact test demonstrate that the synthetic data set generated by this method and another data set generated by the Synthetic Data Vault (SDV) [6] both fail to pull from the same distribution as the real data set. The distribution plots, however, show that some of the distribution information has been captured with this method. Additionally, the machine learning efficacy test provides some hints that the synthetic data set can generate some machine learning insights for rematching algorithms. Considering these issues, this thesis makes three contributions regarding data related to kidney transplantation including:

1. A proposal for a new method to train a WGAN using raw logits for categorical data as direct input to a critic;

2. A phased process to build up a synthetic data set using different methods; and

3. The application of the previous two contributions to partially generate a synthetic data set for kidney transplants that is similar to the actual data set.

Stated differently, this thesis centers on answering three questions:

1. Can a Wasserstein GAN (WGAN) generate categorical data while avoiding the need for Gumbel-Softmax by using a logit output layer on the generator and encoding the real data in a logit-like scheme?

2. Can a synthetic data set be built using an iterative method that combines generative and non-generative methods? 3 3. Can a synthetic data set be developed for kidney transplants that is statistically similar to the original data set?

In answering these questions, the remainder of this thesis is structured as follows: Chapter 2 covers related works on these topics; Chapter 3 details the methods used to generate and evaluate the synthetic data set; Chapter 4 covers the results of each phase of the iterative method as well as the overall results; and Chapter 5 concludes the thesis and highlights areas for future work. 4

CHAPTER 2 RELATED WORKS AND BACKGROUND INFORMATION

In considering the generation of synthetic data for use in the simulation and evaluation of kidney transplantation, there are many related topics that are of interest including:

• Generative Adversarial Neural Networks

• Synthetic Electronic Health Records

• Privacy Evaluation

• Synthetic Data Quality Evaluation

• Generation of Kidney Transplant Data

This chapter reviews these areas.

2.1 Generative Adversarial Neural Networks

GANs were introduced in [7]. At their introduction, they consisted of two neural networks, a generator and a discriminator, that face each other in a competitive game of sorts. The discrimina- tor’s role is to take in data as input and classify that data as coming from the training data or as an output of the generator. The generator takes in a noise vector as input and produces an output and attempts to fool the discriminator into classifying the generated data as part of the training data set. The general setup can be seen in Fig. 2.1. Any advance that the generator makes comes at the detriment of the discriminator and vice versa. The two compete in this zero-sum game, and in the ideal case, find a Nash equilibrium. For the generator this equilibrium corresponds to an output whose probability distribution matches the training data set [7]. In the original GAN, the log-likelihood was used as the objective function for the neural networks with the networks being updated via stochastic gradient descent (SGD) [7]. As work continued on GANs, other objective functions saw use, such as the Wasserstein Dis- tance in what has been aptly named Wasserstein GANs (WGANs) [8]. The definition of the exact 5

Figure 2.1 Generic GAN setup.

Wasserstein Distance can be found in [3, 8], but is intractable. It is estimated using (2.1) where the generator G maps the noise vector z into a distribution pz in an effort to minimize the expectation E of the discriminator D of G(z), while the discriminator maximizes the expectation of D(x) and the negative expectation of D(G(z)). The optimum for this is when pg = pdata, the real data distri- bution [8]. This formulation requires the discriminator to have Lipschitz continuity, which can be enforced with weight clipping [8]. Alternative methods to enforcing Lipschitz continuity include adding a gradient penalty as proposed in [9].

W (x, z) = min max Ex∼p (x)[D(x)] − Ez∼p (z)[D(G(z))] (2.1) G D data z

With typical GAN training using SGD, the layers of the generator need to be differentiable, but 6 this has posed a challenge for GANs trying to model discrete data [3]. Another notable challenge for GANs is an issue called mode collapse, where the generator only models certain modes in the training data [3]. Attempts have been made to address this challenge by adding regularization terms to the objective function or changing the objective function entirely [3, 8]. Some machine learning methods, such as Random Forest, can generate synthetic categorical data without issue and have been used to generate partially synthetic data sets [10], but don’t generate fully synthetic data sets from noise vectors and require data to traverse the decision trees. In GANs, however, this difficulty with discrete data extends to categorical data. The work in [9] achieved some level of success by applying the softmax to the generator output, but not including a sampling step before sending the data to the critic. The real data was maintained as one-hot vectors and sent to the critic in that form. After training, samples were taken from the generator by applying the argmax function. There have been some methods established to overcome this limitation by relaxing the dis- crete random variables into continuous distributions [11] or alternatively viewed as applying the Gumbel-Softmax to the output layer of the generator [5]. The Gumbel-Softmax can take a continu- ous distribution and approaches a one-hot encoded vector representation as temperature is lowered towards zero. By eliminating the stochastic selection required to transform a standard softmax into a one-hot representation, the layer remains differentiable and accommodates SGD. The Gumbel-

Softmax is represented in (2.2) where yi is an element in a sample vector y of length k, πi is the probability of that element, gi is a sample from the Gumbel(0, 1) distribution and τ is the temperature. Another method for multi-categorical data was discussed in [12] and proposed the addition, at the output of the model, of a separate dense layer for each category, followed by the Gumbel-Softmax and then concatenated back together for the output.

exp((gi + log πi)/τ) yi = Pn for i = 1, ..., k. (2.2) j=1 exp((gj + log πj)/τ)

In many ways, the proposed method takes the simplicity of sending continuous data to the critic (similarly to [9]), while transforming the continuous space into something more akin to the 7 precursor values for Gumbel-Softmax methods [5] by transforming the real data into that space. Other alternatives would be abandoning SGD and updating network weights with other optimiza- tion methods like Particle Swarm Optimization (PSO) [13] or other non gradient-based methods.

2.2 Synthetic Electronic Health Records

There are several notable works that have progressed the efforts to synthesize electronic health records (EHRs) using some form of GAN [14–18]. The addition of autoencoders were used in [14, 17, 18] to decode the continuous outputs of the generator into discrete features. Frameworks like medGAN [14], medWGAN [17], and PATE-GAN [16] were limited to binary and count features and did not directly address the synthesis of more advanced data types, like multi-label categorical features. Even COR-GAN [18] was limited to discrete and continuous variables. Recent attempts to improve EHR synthetic data generation have employed the Wasserstein distance [15, 17, 18].

2.3 Privacy Evaluation

While not exclusively the domain of EHR, privacy metrics for synthetic data are a major focus in the synthetic EHR frameworks since even fully synthetic data sets can be prone to certain dis- closure risks [19]. Two categories of risks are membership and disclosure attacks and either one or both are measured in several of the discussed frameworks as a privacy metric [14, 15, 17, 18]. Some methods [16, 19] take a more mathematically rigorous approach to privacy guarantees. In [16], differential privacy is produced through training the discriminator with differentially private real data. This mechanism in essence produces noisy gradients which provide the privacy guaran- tees.

2.4 Synthetic Data Quality Evaluation

Evaluating the quality of synthetic data can take several forms. One broad category is a focus on statistical similarity via dimension-wise statistics and dimension-wise predictions [18, 20], while the other looks at the utility of the data. The utility approach is very use-case specific and is highly dependent on the goals of the synthetic data set. A utility-based approach might choose to focus on a synthetic data trained model’s performance on real data [21] or on maintaining relative model 8 performance over several models evaluated on real and synthetic data [16, 22]. The Synthetic Data Vault [6] has a suite of synthetic data quality metrics of both major types.

2.5 Generation of Kidney Transplant Data

Simulations that develop patient pools for kidney transplants have been developed [23], but typically do not generate patient HLA information and instead rely on a simplification that col- lapses the matching down to a probabilities. While this can be effective to test certain kidney allocation programs, for survival analysis [24] and rematching simulations that involve items such as high-resolution HLA and immunogenicity [25], a more comprehensive set of patient data is required. This has left these types of analyses dependent on access to the real patient data. 9

CHAPTER 3 METHODOLOGY

The methodology proposed in this thesis leverages customized GANs (among other processes) in order to generate synthetic kidney transplantation data for use in simulation, etc. In order to accomplish this, multiple steps were needed including data set preparation, categorical data en- coding, GAN training, donor and recipient patient pool generation, patient age generation, and donor-recipient matching.

3.1 Data Set Preparation

This study used data from the Scientific Registry of Transplant Recipients (SRTR). The SRTR data system includes data on all donor, wait-listed candidates, and transplant recipients in the US, submitted by the members of the Organ Procurement and Transplantation Network (OPTN). The Health Resources and Services Administration (HRSA), U.S. Department of Health and Human Services provides oversight to the activities of the OPTN and SRTR contractors. The transplant data set obtained covers the years 1987 through 2019. The focus of this project was specifically on the kidney transplant subset of that data set. Transplant records where de- veloped by joining various tables, including DONOR DECEASED, REC HISTO, CAND KIPA, and TX KI. Maintenance and Induction drug categories were derived from IMMUNO and FOL IMMUNO. The details of the table linking is provided with the data set, or can be viewed online1. All categorical columns were dictionary encoded using the SRTR 1912 Public SAFs Data Dic- tionary [4]. Only specific columns were extracted from the tables and can be broadly categorized into three groups: Donor Information in Table 3.1, Candidate Information in Table 3.2, and Trans- plant Information in Table 3.3. The bold columns in Table 3.3 were derived and not columns in the data set. The induction and maintenance drug tables were condensed down to form the columns for each transplant instance. The transplants were limited to those with kidneys from deceased donors as that tends to be the focus of many pairing simulations. The pre-cleaned kidney transplant data

1https://www.srtr.org/assets/media/docs/SAFsLinkingDiagram.pdf 10 set with deceased donors contained almost 470,000 records.

Table 3.1 Donor column descriptions and data types [4].

COLUMN NAME TYPE DESCRIPTION DONOR ID Numerical Donor Identifier DON AGE Numerical Donor Age in Years DON GENDER Categorical Donor Gender DON RACE Categorical Donor Race DON ABO Categorical Donor Blood Type DON A1 Categorical Donor HLA - A (1) antigen DON A2 Categorical Donor HLA - A (2) antigen DON B1 Categorical Donor HLA - B (1) antigen DON B2 Categorical Donor HLA - B (2) antigen DON CW1 Categorical Donor HLA - CW (1) antigen DON CW2 Categorical Donor HLA - CW (2) antigen DON DR1 Categorical Donor HLA - DR (1) antigen DON DR2 Categorical Donor HLA - DR (2) antigen DON DQ1 Categorical Donor HLA - DQ (1) antigen DON DQ2 Categorical Donor HLA - DQ (1) antigen

3.1.1 Data Cleaning

The process of data cleaning involved dropping all records with missing data with the excep- tions and their dispositions listed in Table 3.4. Later, the candidate body mass index (BMI) values that were out of range were dropped. In hindsight, it would have been better to drop the BMI outliers before calculating the mean value for missing data replacement. While included in this section for completeness, the BMI values and other columns that were also exceptions were not used for later portions of this project.

3.1.2 Induction Drug Columns

To derive the recipients induction drug category (IND CAT) and whether a steroid or pred- nisone was used for induction (IND PRED), the drug information for each patient from the IMMUNO table was used. Each drug was translated into a category via a dictionary lookup based on informa- tion provided in [26], a summary of which can be found in Table 3.5. Some drugs were categorized 11 Table 3.2 Candidate column descriptions and data types [4].

COLUMN NAME TYPE DESCRIPTION PX ID Numerical Candidate Identifier REC AGE AT TX Numerical Candidate Age at Date of Transplant CAN GENDER Categorical Candidate Gender CAN RACE Categorical Candidate Race CAN ABO Categorical Candidate Blood Type REC A1 Categorical Candidate HLA - A (1) antigen REC A2 Categorical Candidate HLA - A (2) antigen REC B1 Categorical Candidate HLA - B (1) antigen REC B2 Categorical Candidate HLA - B (2) antigen REC CW1 Categorical Candidate HLA - Cw (1) locus REC CW2 Categorical Candidate HLA - Cw (2) locus REC DR1 Categorical Candidate HLA - DR (1) antigen REC DR2 Categorical Candidate HLA - DR (2) antigen REC DQW1 Categorical Candidate HLA - DQ (1) locus REC DQW2 Categorical Candidate HLA - DQ (2) locus CAN EDUCATION Categorical Candidate Education Status CAN PRIMARY PAY Categorical Source of Payment CAN BMI Numerical Candidate Body Mass Index CAN DIAB Categorical Candidate Diabetes Status CAN DIAL Categorical Candidate Dialysis Status CAN LISTING DT datetime Date Candidate added to Wait List

Table 3.3 Transplant column descriptions and data types [4].

COLUMN NAME TYPE DESCRIPTION TX ID Numerical Transplant Indentifier REC HISTO TX ID Numerical Another Transplant Identifier REC COLD ISCH TM Numerical Total Cold Ischemic Time for Organ REC TX DT datetime Date of Transplant TX ERA Categorical Calculated Column to Bin Transplant Date Ranges IND CAT Categorical Category of Induction Drug Regimen IND PRED Categorical Status of Steroid/Prednisone use for Induction IS Summary Categorical Maintenance Regimen Category over the History IS Discharge Categorical Maintenance Regimen Category at Discharge Pred Summary Categorical Steroid/Prednisone use at any time for Maintenance Pred Discharge Categorical Steroid/Prednisone use at Discharge 12 Table 3.4 The disposition of missing data that was not dropped.

COLUMN NAME DISPOSITION CAN DIAB Replaced with code for unknown value CAN DIAL Replaced with code for unknown value CAN EDUCATION Replaced with code for unknown value CAN PRIMARY PAY Replaced with code for unknown value CAN LISTING DT Replaced with value from REC TX DT CAN BMI Replaced with mean BMI

as “other” and can be found in Table 3.6. These categories expand upon the induction categories in [24]. This categorization was used to one-hot encode the category for each patient entry if it was also tagged as used for induction in the REC DRUG INDUCTION column. Then, for each patient, the rows were reduced using a summation of the one-hot vectors to provide a single row summary of all the drugs categories used for induction. An example is provided in Fig. 3.1. The reduced rows were then analysed to determine which of the following categories were appropriate according to the conditions in Table 3.7. In a column exclusively used to track prednisone use for induction, that column copies the prednisone column in the above one-hot vector.

Figure 3.1 Example of the process used to develop the induction drug columns. 13 Table 3.5 Induction drug categories extracted from the Excel file provided by [26].

Drug name Induction category 1 Induction category 2 ALG T-cell depleting agent ATG OKT4 T-cell depleting agent ATG Atgam T-cell depleting agent ATG NRATG /NRATS T-cell depleting agent ATG OKT3 (Orthoclone, muromonab) T-cell depleting agent ATG Anti - LFA -1 LFA1 blocker LFA1 blocker IL - 1 Receptor Antagonist IL1 blocker IL1 blocker T10B9 (Medimmune) T-cell depleting agent ATG Thymoglobulin T-cell depleting agent ATG Zenapax - IL-2 receptor inhibitor IL-2 receptor inhibitor Simulect () IL-2 receptor inhibitor IL-2 receptor inhibitor Campath () CD52 antagonist Alemtuzumab Rituxan () CD20 antagonist CD20 antagonist Nulojix () CTLA-4 analog CTLA-4 analog

Table 3.6 Drugs categorized as ”other” extracted from the Excel file provided by [26].

Drug name Anti - ICAM - 1 Leflunomide (LFL) Cytoxan (cyclophosphamide) (Folex PFS, Mexate-AQ, Rheumatrex) (Bredinin) Xoma Zyme - CD5+ DAB486 - IL -2 Anti - IL - 6 Anti - TNF Soluble IL - 1 Receptor Aldesleukin (IL - 2) Deoxyspergualin (DSG, 15-DSG, Gusperimus, Spanidin) FTY 720

3.1.3 Maintenance Drug Columns

Developing the maintenance drug categorization was more involved than the induction drugs as it deals with a time series of treatment regimens. As in the case of the induction drugs, each 14 Table 3.7 Description of categories used in IND CAT to summarize the induction data.

CATEGORY CONDITION No Induction Only zero values recorded Prednisone Only Only non zero value is prednisone specific category If only one non-prednisone category has a non-zero value, that specific category is listed. specific category If only one non-prednisone category has a non-zero value, in addition to one ’other’ categorized drug, that specific category is listed. Multiple For non-zero values, not included in the above conditions

drug was translated into a category via a dictionary lookup based on information provided in [26], a summary of which can be found in Table 3.8. The treatment regimens are a combination of these categories and also expand upon the maintenance categories in [24]. This categorization was used to one-hot encode the category for each patient entry if it was also tagged as used for a current maintenance regimen in the TFL IMMUNO DRUG MAINT CUR or REC DRUG MAINT columns. Instead of reducing all the rows down to one for each patient, the rows are reduced by patient and time code. Each row is then translated into a treatment regimen based on the combination of drug categories used. The maintenance regimen are listed in Table 3.9. This time-series of treatment regimens is then condensed down into the treatment regimen at discharge and a historical summary of treatment regimens. The historical summary is assigned in one of three ways listed in Table 3.10. There are two other columns that track the used of prednisone for maintenance. One tracks if prednisone was used initially after discharge from transplant surgery. The other tracks if prednisone was used at any time during the time-series of maintenance regimens. An example is provided in Fig. 3.2; note that the column TFL FOL CD tracks the time code bins, with 10 being time of discharge.

3.1.4 Data Set Summary

Out of the almost 470,000 kidney transplants listed in the original data, there remain only 58,656 entries that survived the cleaning process. The final version of the data set consists of 46 15

Table 3.8 Maintenance drug categories extracted from the Excel file provided by [26].

Drug name Maint. Category 1 Maint. Category 2 Imuran (, AZA) Antimetabolite Aza CellCept (MMF) Antimetabolite Aza Brequinar Sodium (BQR) Antimetabolite Myfortic () Antimetabolite MMF Generic MMF (generic CellCept) Antimetabolite MMF Generic Mycophenolic Acid) Antimetabolite MMF Sandimmune inhibitor CsA Neoral Calcineurin inhibitor CsA Prograf () Calcineurin inhibitor TAC Cyclosporin Calcineurin inhibitor CsA Sang Cy A Calcineurin inhibitor CsA Gengraf Calcineurin inhibitor CsA EON (generic cyclosporine) Calcineurin inhibitor CsA Generic cyclosporine Calcineurin inhibitor CsA Astagraf XL (extended release tacrolimus) Calcineurin inhibitor TAC Generic tacrolimus (generic Prograf) Calcineurin inhibitor TAC Envarsus XR (tacrolimus XR) Calcineurin inhibitor TAC Rapamune () mTOR inhibitor mTOR inhibitor Zortress () mTOR inhibitor mTOR inhibitor Generic sirolimus mTOR inhibitor mTOR inhibitor Prednisone Steroids Methylprednisolone Steroids Steroids Steroids

Table 3.9 Maintenance regimens [26] as an expanded list from the regimens used in [24].

REGIMEN TAC & MMF TAC CsA & AzA CsA TAC & AzA TAC & mTOR Other None 16

Figure 3.2 Example of the process used to develop the maintenance drug columns.

Table 3.10 Description of categories used in IS Summary and IS Discharge to summarize the maintenance regimens.

CATEGORY CONDITION No Maintenance If no non-null regimens exist specific regimen If only one regimen is used, the specific regimen is listed Multiple If more than one regimen is used

columns, 36 categorical, 8 numerical, and 2 dates. The categorical column for the transplant era, TX ERA, was not included in this final version, but could be derived from the included transplant date.

3.2 Method Overview

The development of the synthetic data set was done in several phases:

• Phase I. The patient race, gender, blood type, and HLA information were generated using a WGAN. Two patient pools were produced, one for donors and one for recipients;

• Phase II. Patient age was added to each record by randomly sampling from the probability distribution conditioned on race and gender; 17 • Phase III. An existing kidney transplant re-matching algorithm [25] was re-purposed to provide the patient matching between the synthetic donor and recipient pools.

At the end of the donor-recipient matching, the resultant synthetic data set contained race, gen- der, blood type, and HLA information for the donor, recipient pairs. A note on the final synthetic data set is that the age information was lost during the rematching algorithm. This could be rec- tified by either modifying the rematching algorithm or applying the age phase after the matching algorithm. Also, the race information was converted from a higher resolution, numerically encoded categorical value, to a lower resolution categorical column. A statistical analysis was performed using the Fisher-Exact test and distribution plots were used as visualizations.

3.3 Categorical Encoding

To facilitate modeling of categorical variables, without the use of the Gumbel-Softmax used in [21], this thesis proposes a method to encode the categorical data of the real data set that elim- inates the need for the Gumbel-Softmax at the output of the generator and lets the discriminator deal directly with the logit data coming out of the generator. The encoding process takes a categor- ical column and makes a separate column for each unique category, similar to one-hot encoding. Instead of encoding a value with a one in its respective column, a large positive number is used. Instead of zeros in the other columns, a large negative number is used. See Listing 4 and 5 in Ap- pendix A for an implementation of a helper class for storing encoding and decoding information. A value of ± 10 produced an encoding that was reversible with the use of a softmax function to reproduce the original data. While the donor data reversed without error in this trial, it is not math- ematically guaranteed. The value of ten was chosen to minimize the value, while also successfully reversing the donor data without error. This value can be changed to bound the probability of complete reversibility given the number of categorical columns, the number of unique categories in each, and the number of rows in the data set. Noise was introduced into the encoding by adding, to each value, a random variable chosen from a normal distribution centered around zero with a standard deviation of one, ten percent of our chosen encoding scale of ten. See Listing 1 in Appendix A for the noising code using the 18 Numpy module [27]. Noise was added to mitigate the potential for the discriminator to use the uniformity of the real data against the generated data. An abbreviated example is provided in Fig. 3.3.

Figure 3.3 Example of the process used to encode the categorical data.

3.4 WGAN with Logit-based Critic (WGAN-LC)

The donor and candidate patient profiles share a number of columns. In addition to the categor- ical columns listed in Table 3.11, the patient profiles also both have two numerical columns, patient age and an ID number. Both numerical columns are added at a later phase. This leaves thirteen categorical columns to generate. The categorical data was encoded based on the method described in the previous section. A WGAN was chosen due to the success on tabular data experienced with WGAN variants [14, 15, 17, 21]. The basic architecture of the implemented WGAN can be seen in Fig. 3.4. The WGAN was implemented using TensorFlow [28]. The Wasserstein loss implementations can be seen in Listing 2 in Appendix A. The generator and critic architectures consist of two hidden layers and use a leaky ReLu activation function. The details of the generator architecture can be 19

Figure 3.4 WGAN architecture using logit-encoded categorical data. Dotted arrows correspond to the unencoded real data input and fake data output 20 see in Table 3.12 and the critic in Table 3.13. The exact architecture and hyper-parameters did not undergo a rigorous tuning and are tuned based on experience. Other design choices of the GAN are summarized in Table 3.14. The critic was trained five times per training step with weight clipping, while the generator was only trained once per step. One way to accomplish the weight clipping in TensorFlow can be seen in Listing 3 in Appendix A. The generator output and discriminator input dimension of 369 was used to correspond to the encoded size of the selected categorical columns. Training was performed over 200,000 epochs on the donor records. The training was monitored at approximately 50,000 epoch intervals and the decision to continue training was based on expe- rience. To facilitate a form of temperature annealing of the data, the initial values were changed to ± 0.5, which would be linearly scaled up to ± 10 over several epochs. During the first 500 epochs, the real data was annealed from ± 0.5 to ± 10.0 in a linear fashion to help facilitate separation between the two modes. Once the model was trained, the generator was then used to generate two separate sets of 58,656 samples, corresponding to the size of the real data set after cleaning. One synthetic set was compared to donor data while the other was compared to the recipient data. The synthetic and real column distributions were plotted for visual comparisons. Initially, the plan was to train separate models for both the donors and recipients. As such, the first model was trained on only the donor data. Once the donor generator was trained, samples were then generated and analyzed for suitability for both the donor and for recipients. The donor trained generator proved suitable for donors and, to some extent, recipients. There were some differences between the donor and recipient information, including the CW1 antigen column. Since CW1 was not used in the following matching algorithm, the CW1 columns were dropped from both the donor and recipient pools for convenience. Other differences that can be observed with the results in Chapter 4 are with blood type and minor differences with all HLA columns. A separate recipient data trained network is highly desirable for future work, but was not implemented in this research.

3.5 Random Sampling of Conditional Probability Distributions

The numerical column of patient age was added to the donor and recipient data by means of random selection from the real conditional data distributions. Separate age distributions were 21

Table 3.11 Patient information columns and their data types that are generated by the WGAN.

CANDIDATE COLUMN DONOR COLUMN TYPE CAN GENDER DON GENDER Categorical CAN RACE DON RACE Categorical CAN ABO DON ABO Categorical REC A1 DON A1 Categorical REC A2 DON A2 Categorical REC B1 DON B1 Categorical REC B2 DON B2 Categorical REC CW1 DON CW1 Categorical REC CW2 DON CW2 Categorical REC DR1 DON DR1 Categorical REC DR2 DON DR2 Categorical REC DQW1 DON DQ1 Categorical REC DQW2 DON DQ2 Categorical

Table 3.12 Summary of the dimensions and activation functions in the generator architecture.

Layer Size Activation Input 100 None Hidden1 200 Leaky ReLU Hidden2 369 Leaky ReLU Output 369 None

Table 3.13 Summary of the dimensions and activation functions in the critic architecture.

Layer Size Activation Input 369 None Hidden1 369 Leaky ReLU Hidden2 100 Leaky ReLU Output 1 None 22 Table 3.14 Key values used in the generator and critic models.

GENERATOR CRITIC Leaky ReLU negative slope 0.1 0.1 Weight Clipping None +/- 0.01 Learning Rates 0.00001 0.00001 Optimizer RMSprop RMSprop

captured for donors and recipients and those distributions were further conditioned on race and gender. This provided a level of data aggregation, while maintaining at least some important joint distributions. An age was randomly chosen using the applicable distribution for each row in a donor or recipient table. For race-gender combinations with less than 10 samples in the real data, the non-conditional probability distribution was used to preserve privacy. Upon review of the final data set, it is clear that this data was lost due to the matching algorithm used in the next phase. Since some analysis was performed at the end of each phase, results of using this method are still presented in Chapter 4.

3.6 Donor and Recipient Matching Algorithm

To create the donor and recipient pairs, a rematching algorithm from [25] was used to match the donor and recipient pools. Some code modifications were required to use the low resolution HLA data that the synthetic and real data sets contained. Since the rematching code was designed to produce the best possible matches with given patient pools, using the original rematching code, modified for low resolution HLA, provided pairs skewed towards low HLA mismatches when compared to the HLA mismatch distribution of the real data set. Additionally, the rematching code only looked at blood type compatibility which did not correspond to the distributions in the real data set which were preferenced towards exact blood type matches. To adjust the HLA mismatch distribution to more closely resemble the real data set, an HLA mismatch value was randomly chosen from the real calculated distribution for each recipient, and the first donor that matched that mismatched value exactly was chosen. To adjust for the blood type matching a new hyperparameter was introduced. The hyperparameter set the fraction of the time 23 that an exact blood type match was used when pairing, instead of just blood type compatibility. The value of 0.25 was chosen manually, by trial and error, to produce blood type pairings similar to the real data set. This blood type restriction was not used when an exact HLA match was searched for, as it proved too restrictive and underrepresented perfect HLA matches. To allow for the more restrictive matching, it was necessary to have more donors than recipients. A recipient pool of 10,000 was sampled from a pool of almost 60,000, whereas the donor pool consisted of 50,000 samples from a pool of almost 60,000. An abbreviated version of the adapted donor-recipient matching code is in Listing 6 of Appendix A. The race column for the donor and recipient pools, DON RACE and CAN RACE, had to be translated into a lower resolution race column, DON RACE SRTR and CAN RACE SRTR. These RACE SRTR columns happen to also be in the real data set, but were not originally slated for capture and generation. This translation was performed with a simple translation function that assigned the five race codes for White, Black, Asian, Native, and Pacific to those categories, and all other race codes were grouped under a catch-all ”Multi” category. 24

CHAPTER 4 RESULTS

Between each phase (described in the previous chapter) an analysis was performed to evaluate the method used for that particular phase. The initial intention was to apply the chi-squared test as the primary metric for a goodness of fit test as it is well suited towards categorical data. The chi-squared goodness of fit test turned out to be inappropriate since fewer than 5 expected or ob- served samples existed for some levels in a category. Instead, the Fisher-exact test was performed using the stats library in R [29] with the Python library rpy2 [30]. The statistical analysis was complemented by a visual analysis of the distribution plots. These analyses include:

• The GAN with logit-based discriminator (GAN-LD) which produced the patient medical profile of HLAs, blood type, gender and race.;

• The conditional distributions that generated and assigned an age based on gender and race; and

• An evaluation of the matching algorithm that generated donor and recipient pairs.

For a final comparison, two Fisher-exact test benchmarks are developed: one using a random sampling from the real data set, the second using a synthetic data set generated using the Gaussian Copula from the Synthetic Data Vault (SDV) [6]. For a utility metric, the real and final synthetic data set are both run through a unmodified, low resolution HLA rematching algorithm with a sample size of 1000 and their HLA mismatch score distributions compared.

4.1 Platform

All computational work was performed on a BGSU server approved for PHI storage and with the following hardware specifications:

• Intel Xeon E5-2620 (x2)

• 32GB RAM (4GB x8) 25 • 1TB 7200RPM HDD

• NVIDIA Tesla K20

4.2 Patient Medical Profile Results

The WGAN using the logit-encoded real data produced the following results after training on the donor data for 200,000 epochs. For the donor data, a synthetic sample of 59,856 was generated and compared to the real data of the same size via a Fisher-Exact test. The Fisher-Exact results are in Table 4.1. With the p-value less than 0.05, we can reject the null hypothesis that the real and synthetic data are drawn from the same distribution.

Table 4.1 Fisher-Exact results for the patient profile data generated by the WGAN.

Column p-value Column p-value DON A1 0.0005 REC A1 0.0005 DON A2 0.0005 REC A2 0.0005 DON B1 0.0005 REC B1 0.0005 DON B2 0.0005 REC B2 0.0005 DON DR1 0.0005 REC DR1 0.0005 DON DR2 0.0005 REC DR2 0.0005 DON ABO 0.0005 CAN ABO 0.0005 DON DQ1 0.0005 REC DQW1 0.0005 DON DQ2 0.0005 REC DQW2 0.0005 DON GENDER 0.0000 CAN GENDER 0.0000

Given the distributions are not statistically similar enough to say they are from the same dis- tribution, a visual analysis of the distributions was performed. The categories were reindexed and ordered in descending order of frequency based on the real data distribution for each column. It is important to note that the order of reindexed categories is not guaranteed to be the same between the donor and recipient data. While the y-value has been normalized to frequency, note that there are almost 60,000 samples. Frequency deviations that might have been statistically acceptable at low sample sizes are further restricted at high sample sizes. The visualization for the race data and CW HLA are omitted here as the race had already been translated in Phase 3 and the CW antigen dropped before this visualization was created. The distribution plots can be seen in Figs. 4.1 – 4.5. 26

Figure 4.1 Patient profile frequency distributions, part 1 of 5. 27

Figure 4.2 Patient profile frequency distributions, part 2 of 5. 28

Figure 4.3 Patient profile frequency distributions, part 3 of 5. 29

Figure 4.4 Patient profile frequency distributions, part 4 of 5. 30

Figure 4.5 Patient profile frequency distributions, part 5 of 5. 31 Comparing the blood type graphs for donor and recipient, it is clear that there are significant differences between the real data of the two even without identifying the exact blood types. There are also differences between the HLA distribution from the real donor and recipients. This is seen with the CW HLA for example in Fig. 4.6 which compares the real CW1 data for the donor and recipients. A separate generator could be trained on the recipient data to improve the results.

Figure 4.6 Comparison between the real donor and recipient data CW1 HLA distributions.

4.3 Age Assignment Results

The Fisher-Exact results for the donor and recipient ages data were both 0.0005. With the p-values less than 0.05, we can reject the null hypothesis that they are both drawn from the same distribution. At first glance this might seem a surprising result, considering the method for gen- eration was randomly sampling from the real distribution. A deeper consideration of the method should focus on the conditional distributions of age on race and gender. Inspecting the age distri- bution results for white males in Fig. 4.8, there seems to be a bias of the synthetic data towards the higher ages. This might indicate an issue in the random sampling implementation used in this method. The deviations for the overall age distribution is some combination of the distribution anomalies in race and gender, and potential issues with the random sampling implementation. 32

Figure 4.7 Age distribution plots for real and synthetic data sets. Donor plot on top and recipient plot on bottom. 33

Figure 4.8 Age distribution plots for real and synthetic data sets for white males. Donor plot on top and recipient plot on bottom. 34 4.4 Donor and Recipient Matching Results

To utilize the matching algorithm to pair the donor and recipient data, the race columns had to be converted to RACE SRTR columns. Since the the RACE SRTR columns have fewer race categories, it is worth checking the distributions again with the Fisher-Exact test before the match- ing process. The Fisher-Exact results for the RACE SRTR columns is 0.0005 for both donor and recipient. The pairing of donor and recipients provides an opportunity to look at the Fisher-Exact results for the pairs. The distribution plots should also be reevaluated since the pairing algorithm used 1000 recipients and 10,000 donor to produced 1000 donor-recipient pairs. This could change the donor distributions. Unsurprisingly, the Fisher-Exact results were the same as previous phases with the phased approach of WGAN and matching algorithm achieving a p-value of 0.0005 in all columns. The Fisher-Exact results of the synthetic data set generated by the SDV [6] with Gaussian copulas also had a p-value of 0.0005 in all columns. With the p-value being less than 0.05, the null hypothesis that the real and synthetic datasets come from the same distribution can be rejected for both the synthetic data sets. The statistical results are summarized in Table 4.2. The distribution plots for the donor columns can be viewed in Fig. 4.9 and 4.10 and the recipient columns in Fig. 4.11 and 4.12. It is worth noting that several variables are lost during the matching algorithm including age, gender, and the DQ HLA and were not included in the statistical tests after matching. Had the CW HLA data not been dropped previously, it too would have been lost in this step. This could be corrected in future work by further modifying the matching algorithm to retain those columns. 35

Table 4.2 Fisher-Exact results for the paired patient data.

p-value Column WGAN+M SDV DON A1 0.0005 0.0005 DON A2 0.0005 0.0005 DON B1 0.0005 0.0005 DON B2 0.0005 0.0005 DON DR1 0.0005 0.0005 DON DR2 0.0005 0.0005 DON ABO 0.0005 0.0005 DON RACE SRTR 0.0005 0.0005 REC A1 0.0005 0.0005 REC A2 0.0005 0.0005 REC B1 0.0005 0.0005 REC B2 0.0005 0.0005 REC DR1 0.0005 0.0005 REC DR2 0.0005 0.0005 CAN ABO 0.0005 0.0005 CAN RACE SRTR 0.0005 0.0005 36

Figure 4.9 Distribution plots for donor columns in matched synthetic and real data sets, part 1 of 2. 37

Figure 4.10 Distribution plots for donor columns in matched synthetic and real data sets, part 2 of 2. 38

Figure 4.11 Distribution plots for recipient columns in matched synthetic and real data sets, part 1 of 2. 39

Figure 4.12 Distribution plots for recipient columns in matched synthetic and real data sets, part 2 of 2. 40 4.5 Rematching Results

The rematching results for the real and synthetic data yielded two improved and visually similar results. The real data improved from an average mismatch score of 4.03 to 1.77 and the synthetic data improved from 4.06 to 1.53. Not reflected in the score is 149 and 148 unmatched donor- recipient pairs for the real and synthetic data, respectively. The original mismatch distribution at the top of Fig. 4.13 can also be inspected as a derived column of the original matching in Phase 3 which also shows visually similar results between the real and synthetic data. Overall it seems this synthetic data set may provide insights for rematching models that is translatable to their real world data performance. The synthetic data show that the current greedy algorithm makes better matches with a mode of 2 mismatches, but leaves 15% of the pairs unmatched in a 1000 recipient by 1000 donor rematching trial. 41

Figure 4.13 Rematching results on real and synthetic data sets. Note that 7 mismatches represents the unmatched pairs. 42

CHAPTER 5 CONCLUSION

Significant progress was made towards developing a synthetic data set for kidney transplanta- tion. The kidney transplant data set is a highly categorical data set which has been a challenge for GAN-type architectures. This thesis functioned as a proof-of-concept for a new method to encode categorical data using a logit-based encoding system that negated the requirement for us- ing a Gumbel-Softmax at the output layer of the generator. It also demonstrated that an iterative approach that divided the data into functional groups and assembled them in a phased approach was feasible. While the Fisher-Exact test showed that the synthetic data set was not statistically similar to the real data, the distribution plots displayed significant similarities between the two data sets. The results of the re-matching algorithm performed on the completed synthetic and real data sets demonstrate that it is not necessary to achieve an exact statistical similarity to achieve useful machine learning results. While a complete synthetic data set was not completed within the bounds of this project, these methods and results have demonstrated that such a synthetic data set is achievable with a combination of WGANs and matching algorithms.

5.1 Future Work

There are several avenues available for future work. Focusing on the kidney transplant data set, work should continue to expand to the columns that were listed in Tables 3.1 – 3.3 and improving the current columns generated in this project. One such improvement would be a recipient trained generator during the phase 1 of patient data generation. In general, there should be an expansion of the statistical and utility metrics used to evaluate the synthetic data set with a focus on machine learning efficacy metrics. It would also be interesting to expand the patient generator to produce complete donor-recipient pairs and evaluate if the matching algorithm phase could be bypassed completely. It’s unclear if blood type compatibility and HLA mismatching distributions would be maintained, but if they are, it would simplify the generation process significantly. This could potentially be expanded to all 43 categorical columns. A proper privacy analysis could be performed on the synthetic data set to evaluate the privacy of the proposed method. This could then be compared to other methods with empirical privacy mea- surements such as EMR-WGAN [14], and medWGAN [17], and also ones specifically designed with differential privacy guarantees such as COR-GAN [18], PATE-GAN [16]. Another focus of future work would be a deeper investigation into using the logit-encoding as a means to replicate categorical data in a GAN architecture. Significant benchmarking could be performed versus alternatives such as the Gumbel-Softmax. In the domain of benchmarking, a comparative analysis could be performed between WGAN using the logit-encoding and other synthetic data generators, such as CTGAN [21], Gaussian copulas, and Tabular Variational Au- toencoders (TVAE) [6]. 44

BIBLIOGRAPHY

[1] Office for Civil Rights and B. Malin, “Methods for de-identification of PHI,” Nov 2015, last accessed on 2021-03-12. [Online]. Available: https://www.hhs.gov/hipaa/for-professionals/ privacy/special-topics/de-identification/index.html

[2] S. M. Bellovin, P. K. Dutta, and N. Reitinger, “Privacy and synthetic datasets,” Stan. Tech. L. Rev., vol. 22, p. 1, 2019.

[3] J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on generative adversarial networks: Algorithms, theory, and applications,” 2020.

[4] Scientific Registry of Transplant Recipients. ”SAF Data Dictionary.” Accessed Mar. 21, 2021. [Online]. Available: https://www.srtr.org/requesting-srtr-data/saf-data-dictionary/

[5] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” 2017.

[6] N. Patki, R. Wedge, and K. Veeramachaneni, “The Synthetic Data Vault,” in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Oct 2016, pp. 399–410.

[7] I. J. Goodfellow et al., “Generative adversarial nets,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cam- bridge, MA, USA: MIT Press, 2014, p. 2672–2680.

[8] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017, p. 214–223.

[9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Advances in Neural Information Processing Systems, 45 I. Guyon et al., Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https: //proceedings.neurips.cc/paper/2017/file/892c3b1c6dccd52936e27cbd0ff683d6-Paper.pdf

[10] G. Caiola and J. P. Reiter, “Random forests for generating partially synthetic, categorical data,” Trans. Data Privacy, vol. 3, no. 1, p. 27–42, Apr. 2010.

[11] C. J. Maddison, A. Mnih, and Y. Whye Teh, “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables,” arXiv e-prints, p. arXiv:1611.00712, Nov. 2016.

[12] R. D. Camino, “GAN applications with discrete data,” SEDAN Lab, SnT, University of Lux- embourg, Luxembourg, 2018.

[13] M. Zamani and A. Sadeghian, “A variation of particle swarm optimization for training of artificial neural networks,” in Computational Intelligence and Modern Heuristics, A.-D. Ali, Ed. IntechOpen, February 2010, pp. 131–144.

[14] E. Choi et al., “Generating multi-label discrete patient records using generative adversarial networks,” in Proceedings of the 2nd Machine Learning for Healthcare Conference, ser. Proceedings of Machine Learning Research, F. Doshi-Velez et al., Eds., vol. 68. Boston, Massachusetts: PMLR, 18–19 Aug 2017, pp. 286–305. [Online]. Available: http://proceedings.mlr.press/v68/choi17a.html

[15] Z. Zhang, C. Yan, D. A. Mesa, J. Sun, and B. A. Malin, “Ensuring electronic medical record simulation through better training, modeling, and evaluation,” Journal of the American Med- ical Informatics Association, vol. 27, no. 1, pp. 99–108, 10 2019.

[16] J. Yoon, J. Jordon, and M. van der Schaar, “PATE-GAN: Generating synthetic data with differential privacy guarantees,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=S1zk9iRqF7

[17] M. K. Baowaly, C. C. Lin, C. L. Liu, and K. T. Chen, “Synthesizing electronic health records 46 using improved generative adversarial networks,” J Am Med Inform Assoc, vol. 26, no. 3, pp. 228–241, 03 2019.

[18] A. Torfi and E. A. Fox, “CorGAN: Correlation-capturing convolutional generative adversarial networks for generating synthetic healthcare records,” in Proceedings of the Thirty-Third International Florida Artificial Intelligence Research Society Conference,

Originally to be held in North Miami Beach, Florida, USA, May 17-20, 2020, R. Bartak´ and E. Bell, Eds. AAAI Press, 2020, pp. 335–340. [Online]. Available: https://aaai.org/ocs/index.php/FLAIRS/FLAIRS20/paper/view/18458

[19] J. Hu, J. P. Reiter, and Q. Wang, “Disclosure risk evaluation for fully synthetic categorical data,” in Privacy in Statistical Databases, J. Domingo-Ferrer, Ed. Cham: Springer Interna- tional Publishing, 2014, pp. 185–199.

[20] C. Soneson and M. D. Robinson, “Towards unified quality verification of synthetic count data with countsimQC,” Bioinformatics, vol. 34, no. 4, pp. 691–692, 02 2018.

[21] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling tabular data using conditional GAN,” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https: //proceedings.neurips.cc/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf

[22] J. Jordon, J. Yoon, and M. van der Schaar, “Measuring the quality of synthetic data for use in competitions,” arXiv e-prints, vol. abs/1806.11345, 2018. [Online]. Available: http://arxiv.org/abs/1806.11345

[23] N. Santos, P. Tubertini, A. Viana, and J. P. Pedroso, “Kidney exchange simulation and optimization,” Journal of the Operational Research Society, vol. 68, no. 12, pp. 1521–1532, 2017. [Online]. Available: https://doi.org/10.1057/s41274-016-0174-3

[24] B. Li et al., “Predicting patient survival after deceased donor kidney transplantation using flexible parametric modelling,” BMC Nephrology, vol. 17, no. 1, 2016. 47 [25] J. Kleinknecht, “Machine learning and computational methods for evaluating kidney graft allocation,” Master’s thesis, Bowling Green State University, Aug 2020.

[26] D. A. Bekbolsynov, “Drug codes 092320,” Email exchange, Sep 2020, excel file.

[27] C. R. Harris et al., “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020. [Online]. Available: https://doi.org/10.1038/s41586-020-2649-2

[28] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https: //www.tensorflow.org/

[29] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2013, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.org/

[30] L. Gautier, “rpy2: Python interface to the R language (embedded R),” last accessed on 2021-03-19. [Online]. Available: https://pypi.org/project/rpy2/ 48

APPENDIX A SELECT PYTHON CODE FRAGMENTS

1 i m p o r t numpy as np 2 3 data += np.random.normal(0., encoding magnitude/10.0, size=data.shape) Listing 1: Adding noise to encoded categorical data.

1 i m p o r t tensorflow as tf 2 3 d e f discriminator l o s s ( r e a l o u t p u t , g e n o u t p u t ) : 4 r e t u r n tf.reduce mean ( g e n output) − tf.reduce mean ( r e a l o u t p u t ) 5 6 d e f generator l o s s ( g e n o u t p u t ) : 7 r e t u r n −tf.reduce mean ( g e n o u t p u t ) Listing 2: Loss functions for generator and discriminator

1 # During each discriminator training step 2 f o r weights in discriminator.trainable v a r i a b l e s : 3 weights.assign(tf .clip b y value(weights , −0.01, 0.01, name=None)) Listing 3: Discriminator weight clipping 49

1 i m p o r t numpy as np 2 3 d e f random choice(choices , p): 4 pick = np.random.multinomial(1, p) 5 pick = np.argmax(pick) 6 r e t u r n choices[pick] Listing 4: Function to help with softmax decoding 50

1 i m p o r t pandas as pd 2 c l a s s LogitEncoder(): 3 d e f fit(self, df, cat c o l u m n s ) : 4 s e l f . n o n cat = df.columns.tolist() 5 self .columns orig = df.columns.tolist() 6 s e l f . d t y p e o r i g = {} 7 f o r column in df.columns: 8 s e l f . d t y p e orig[column] = df[column].dtype 9 f o r column in cat c o l u m n s : 10 s e l f . n o n cat .remove(column) 11 self.length = len(self.non c a t ) 12 self.cat = cat c o l u m n s 13 s e l f . c a t d i c t = {} 14 c a t l e n = 0 15 f o r column in self.cat: 16 s e l f . c a t dict[column] = df[column].unique() 17 c a t l e n += len(df[column].unique()) 18 self.length += cat l e n 19 self.columns = self.non c a t . copy ( ) 20 f o r column in self.cat: 21 f o r cat in self.cat dict [column]: 22 self .columns.append(str (column)+’−’+str(cat)) 23 d e f encode(self , df, values=(−10.0, 10.0)): 24 d a t a = [ ] 25 d e f f1(x, cat): 26 i f x == cat: 27 r e t u r n values[1] 28 e l s e: 29 r e t u r n values[0] 30 f o r column in self.non c a t : 31 data .append(df[column]) 32 f o r column in self.cat: 33 f o r cat in self.cat dict [column]: 34 data .append(df[column]. apply(lambda x: f1(x,cat))) 35 e n c o d e d df = pd.concat(data , axis=1, keys=self.columns) 36 r e t u r n encoded d f 37 d e f decode(self , encoded d f ) : 38 d a t a = [ ] 39 f o r column in self.non c a t : 40 data .append(encoded df[column]) 41 f o r column in self.cat: 42 names = [ ] 43 f o r cat in self.cat dict [column]: 44 names.append(str (column)+’−’+str(cat)) 45 e n c o d e d soft = softmax(encoded df[names], axis=1) 46 data .append(encoded s o f t .apply(lambda x: random choice(names, x) \ 47 , a x i s =1) .apply(lambda x: x[len(str(column))+1:])) 48 df = pd.concat(data , axis=1, keys=self.columns o r i g ) 49 f o r column in df.columns: 50 df[column] = df[column].astype(self.dtype orig[column]) 51 r e t u r n df Listing 5: Helper class to store encoding information 51

1 d e f customMatchHLA(donData, recData , cutOff , resultsPath , resultsFilename , \ 2 considerBloodType = True, writeOutput = True): 3 r e s u l t s = [ ] 4 f o r rRow in recData.itertuples(): 5 matchFound = False 6 cutOff = np.random.choice(misM p.index, size=1, \ 7 p=misM p.values , replace=True)[0] 8 a l p h a = 0 . 2 5# tuning parameter 9 c h o i c e = 1 10 i f cutOff > 0 : 11 choice = np.random.choice([0,1],size=1, \ 12 p=[alpha , 1.0−alpha], replace=True)[0] 13 f o r dRow in donData.itertuples(): 14 runMatch = True 15 i f considerBloodType == True: 16 i f choice == 0: 17 runMatch = checkExactABO(dRow[7] , rRow[7]) 18 e l s e: 19 runMatch = checkABO(dRow[7] , rRow[7]) 20 i f runMatch == True: 21 misMatches = len(set(dRow[1:7]) . difference(set(rRow[1:7]))) 22 i f misMatches == cutOff: 23 r e s u l t ={ 24 ”DON ID”: dRow[8], 25 ”DON RACE SRTR”: dRow[9], 26 ”DON A1”: dRow[1], 27 ”DON A2”: dRow[2], 28 ”DON B1”: dRow[3], 29 ”DON B2”: dRow[4], 30 ”DON DR1”: dRow[5], 31 ”DON DR2”: dRow[6], 32 ”DON ABO”: dRow[7], 33 ”REC ID”: rRow[8], 34 ”REC RACE SRTR”: rRow[9], 35 ”REC A1”: rRow[1], 36 ”REC A2”: rRow[2], 37 ”REC B1”: rRow[3], 38 ”REC B2”: rRow[4], 39 ”REC DR1”: rRow[5], 40 ”REC DR2”: rRow[6], 41 ”REC ABO”: rRow[7], 42 ”Mismatches”: misMatches , 43 ”cutOff”: cutOff, 44 ”bloodTypeConsidered”: considerBloodType 45 } 46 matchFound = True 47 results .append(result) 48 recData.drop(rRow[0], inplace=True) 49 donData.drop(dRow[0] , inplace=True) 50 b r e ak# Only doing first come first serve 51 r e t u r n donData, recData, results Listing 6: Matching algorithm adapted from [25]