Paper DV03 Using Sankey to Analyze Drug Pipeline Tanmay Khole, Bristol-Myers Squibb, Berkeley Heights NJ, USA

ABSTRACT

Sankey are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. Sankey diagrams put a visual emphasis on the major transfers or flows within a system. They are helpful in locating dominant contributions to an overall flow. This paper will focus on drug pipeline of a sponsor and leverage data from clinicaltrials.gov to analyze number of clinical trials a sponsor has with respect to conditions, interventions, and phases. This will be visualized with the use of Sankey diagram and display the weightage a sponsor has given to a drug or a condition based on the phases of clinical trials. A drug pipeline gives us an idea about the future of a company and this paper will give a deep dive on some of the aspects by use of sankey diagram.

INTRODUCTION

This paper analyzes data from clinicaltrials.gov for selected few clinical trial sponsors and uses that info to create sankey diagram. A sankey diagram is a used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages and data from clinicaltrials.gov is an excellent example to analyze a sponsor’s drug pipeline to see which clinical condition or interventions are focused by sponsor with respect to stages of clinical trials. Techniques such as data mapping, data analysis and are used to create the sankey diagrams displayed in this paper. Phase I clinical trials are excluded from data analysis and data visualization for ease of understanding the flow of clinical trials which are in Phase 2-4. Data is obtained in csv file format from clinicaltrials.gov using advanced search option and searching only for sponsor section. Analysis is performed on trials with status: "Active, not recruiting", "Available", "Enrolling by invitation, "Not yet recruiting", or "Recruiting".

- 1 -

SANKEY DIAGRAM FOR CLINICALTRIALS.GOV DATA

Data obtained from clinicaltrials.gov in csv format is one record per trial, see figure 1. In order to use it for Sankey diagram, it needs to be processed as per below steps:

• Data Mapping

• Data Analysis

• Data Visualization

Figure 1: Data obtained from clinicaltrials.gov and imported into SAS® dataset.

Sponsors listed in 1 are considered in this paper for data analysis and to create sankey diagrams for the on-going clinical trials of each sponsor. Clinical trials with status: "Active, not recruiting", "Available", "Enrolling by invitation, "Not yet recruiting", or "Recruiting" are considered as on-going. Only those clinical trials are selected where sponsor is the lead sponsor of that clinical trial.

Sponsor Distinct On-going Data Extraction Date Clinical Trials Count

Sponsor 1 Bristol-Myers Squibb 250 22NOV2019

Sponsor 2 Janssen 126

Sponsor 3 Merck & Co. 173 22JAN2020 Sponsor 4 Amgen 56

Sponsor 5 Bayer 56

Table 1: List of Sponsors

- 2-

DATA MAPPING

Data mapping is an essential component in order to connect links and nodes in sankey diagrams. Clinical trials data obtained from clinicaltrials.gov contains multiple names for same conditions (e.g.: “NSCLC”, “Non-Small Cell Lung Cancer”, or “Carcinoma, Non-Small-Cell Lung”), figure 2, and multiple names for same drug/biologic compounds (e.g.: "Nivolumab", "Opdivo", "BMS- 936558", "ONO-4538“), figure 3. Hence it is important to identify each condition and intervention into correct category. As there are numerous conditions, they are mapped into high-level categories like Solid Tumors, Cardiovascular, Leukemia & Lymphoma, etc. See figure 4 for example of mapping different conditions to high-level category.

Figure 2: Mapping different names of same condition into single category.

Figure 3: Mapping different names of same compound/intervention into single category.

Figure 4: Mapping different conditions to high-level category.

- 3-

Below mapping rules are applied before data analysis step. The mapping rules are designed to identify the focus of the sponsor regards to clinical conditions/interventions.

• Clinical trials with multiple phases are mapped toward the higher phase

• Clinical trials with multiple clinical conditions are mapped towards each condition

• Clinical trials with multiple interventions are mapped towards each intervention of the respective sponsor

Example 1: Clinical trial NCT03331198, title “Study Evaluating Safety and Efficacy of JCAR017 in Subjects With Relapsed or Refractory Chronic Lymphocytic Leukemia (CLL) or Small Lymphocytic Lymphoma (SLL)”, has trial design for phase 1 and phase 2. As per the mapping rules, it will be mapped for Phase 2 only. This trial also has multiple clinical conditions listed such as Chronic Lymphocytic Leukemia, Small Lymphocytic Lymphoma, and will be mapped to each clinical condition as per the mapping rules.

Example 2: Clinical trial NCT04088500, title “A Study of Combination Nivolumab and Ipilimumab Retreatment in Patients With Advanced Renal Cell Carcinoma” has multiple interventions: Nivolumab and Ipilimumab. As per the mapping rules, this trial will be mapped to each intervention listed.

Example 3: Clinical trial NCT03036098, title “Study of Nivolumab in Combination With Ipilimumab or Standard of Care Chemotherapy Compared to the Standard of Care Chemotherapy Alone in Treatment of Patients With Untreated Inoperable or Metastatic Urothelial Cancer” has multiple interventions: nivolumab, ipilimumab, gemcitabine, cisplatin, carboplatin but only the first two are sponsor’s compounds, hence this trial will be mapped to two interventions: nivolumab & ipilimumab.

Data mapping for this paper is performed by creating flags/identifiers for each condition and intervention listed in respective sponsor’s clinical trials data. Each sponsor listed in table 1 have unique compounds and mapping of each compound/intervention is required by closely observing the data.

Data obtained from clinicaltrials.gov is one record per trial (horizontal data format) and it needs to be transformed into vertical data format as shown in figure 5 by using the flags created for each condition category and intervention.

- 4-

Figure 5: Horizontal data mapped and transformed into vertical data format

DATA ANALYSIS

Data analysis is performed by calculating number of objects with respect to its categories which needs to be displayed in sankey diagram. The categories are used as nodes and the count of those objects are used to determine the width of links between the selected categories. In this paper, data analysis is performed by calculating number of clinical trials with respect to sponsor, conditions, interventions, and phases. This step is performed after data mapping to ensure correct connection of links and nodes. SAS® macro %sankey_nodes is used for data analysis and reference code can be found in the appendix.

%sankey_nodes(inds = ct_gov ,outds = sankey_out ,nodes=%str(sponsor|conditions|interventions|phases) ,cond = );

%sankey_nodes will calculate the number of objects, in this case, number of clinical trials. The mapped data is fed into “inds” macro parameter. The nodes (categories) which needs to be displayed in the sankey diagram are listed in “nodes” macro parameter and if any condition needs to be applied, it can be listed in “cond” macro parameter. This macro creates a macro variable &sankeydata. and output dataset which has data for sankey diagram stored in it. It gets used in the data visualization step to create sankey diagram.

- 5-

DATA VISUALIZATION

Data visualization step is performed using SAS® macro %sankey2html and D3.js which is a JavaScript library. The output created is in HTML format.

%sankey2html(indata = %nrbquote(&sankeydata.) ,outfl = %sysfunc(pathname(outg,f))/sankey.html ,width = 2100 ,height = 700 ,flow_num = );

%sankey2html macro reads macro variable &sankeydata. created from %sankey_nodes and implement it in HTML file. The output file location and HTML filename is specified in “outfl” macro parameter. “width” and “height” parameters are used for sankey diagram height and width. “flow_num” parameter is used to display link labels above a specified number.

Sankey diagrams displays flow of number of clinical trials from SPONSOR → CONDITIONS → INTERVENTIONS → PHASES which are also used as nodes for sankey diagrams displayed in this paper. The thickness of the links signifies the number of clinical trials connecting the nodes.

- 6-

SANKEY DIAGRAM 1 Sponsor: Bristol-Myers Squibb

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov.

- 7 -

SANKEY DIAGRAM 2 Sponsor: Janssen

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov. -8-

SANKEY DIAGRAM 3 Sponsor: Merck & Co.

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov. -9-

SANKEY DIAGRAM 4 Sponsor: Amgen

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov. -10-

SANKEY DIAGRAM 5 Sponsor: Bayer

Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov. -11-

CONCLUSION

Sankey diagram is an impressive data visualization tool to understand flow of clinical trials. It helps to track several clinical trials in a single view. Sankey diagram also facilitates to understand weightage of clinical condition or intervention with respect to the phases of clinical trials and it represents flow in a manner that can be understood by anyone, instantly. Sankey diagrams in this paper allows user to see complex pipeline of a sponsor in a single image with a focus on the clinical conditions and interventions/compounds of that sponsor. Sankey diagrams make dominant clinical conditions or interventions stand out, and they help users to see relative magnitudes and/or areas with the largest opportunities. By using provided macros, sankey diagram can be adjusted as per user’s need. Sankey diagrams offer the added benefit of supporting multiple viewing levels. Users can get a high-level view, see specific details, or generate custom diagrams by using provided macros.

NOTE FROM THE AUTHOR

Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the public data acquired from clinicaltrials.gov.

This presentation reflects views of the author and should not be construed to represent any of the clinical trial sponsors’ pipeline.

ClinicalTrials.gov is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions.

ACKNOWLEDGMENTS

I would like to thank Vineet Mathur and Simon Xue for their guidance and support for this paper. You supported me greatly and were always willing to help me.

I would like to thank the dedicated people who manage and maintain clinicaltrials.gov and d3js.org. Without the resources from these sources, this paper wouldn’t be possible.

- 12 -

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at: Author Name: Tanmay Khole Company: Bristol-Myers Squibb Address: 300 Connell Drive, Berkeley Heights City / Postcode: NJ 07922 Email: [email protected]

Brand and product names are trademarks of their respective companies.

APPENDIX

%macro sankey_nodes(inds=, outds=, nodes=, cond=);

%let cnt = %eval(%sysfunc(countc(&nodes.,"|")) +1); %put &cnt.;

data _inds; set &inds.; run;

%do i = 1 %to &cnt; %let single&i. = %scan(&nodes, &i , '|'); %put single&i. = &&single&i; %end;

proc sql; %do i = 1 %to %eval(&cnt. -1); create table &&single&i.._wt_chk as select distinct %do k=1 %to &i. ; &&single&k., %end; %superq(single%eval(&i. +1)), &&single&i. as SOURCE length=100, %superq(single%eval(&i. +1)) as TARGET length=100, count(&&single&i.) as VALUE, " {'source':'"||strip(&&single&i.)||"','target':'"||strip(%superq(single%eval(&i. +1)))||"','value':"||strip(put(count(&&single&i.), 5.0))||"}," as final length=1000 from _inds %if &cond. ne %then %do; where &cond. %end; group by %do k=1 %to &i. ; &&single&k., %end; %superq(single%eval(&i. +1)) ; %end; quit;

data &outds.; set %do i = 1 %to %eval(&cnt. -1); &&single&i.._wt_chk %end; ; run; -13-

options linesize=max; %global sankeydata;

proc sql noprint; select final into: sankeydata separated by " " from &outds. ; quit;

%put &sankeydata. ;

%mend sankey_nodes;

%macro sankey2html(indata=, outfl=, width=, height=, flow_num=);

data _null_; file "&outfl."; put ''; put ''; put ''; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ' '; put ''; put ''; put '

'; put ' '; put ' '; put ' '; put ' '; put ' '; put ''; put '';

;;;;

run;

%mend sankey2html;

-21-