Using Sankey Diagram to Analyze Drug Pipeline Tanmay Khole, Bristol-Myers Squibb, Berkeley Heights NJ, USA

Paper DV03 Using Sankey Diagram to Analyze Drug Pipeline Tanmay Khole, Bristol-Myers Squibb, Berkeley Heights NJ, USA ABSTRACT Sankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. Sankey diagrams put a visual emphasis on the major transfers or flows within a system. They are helpful in locating dominant contributions to an overall flow. This paper will focus on drug pipeline of a sponsor and leverage data from clinicaltrials.gov to analyze number of clinical trials a sponsor has with respect to conditions, interventions, and phases. This will be visualized with the use of Sankey diagram and display the weightage a sponsor has given to a drug or a condition based on the phases of clinical trials. A drug pipeline gives us an idea about the future of a company and this paper will give a deep dive on some of the aspects by use of sankey diagram. INTRODUCTION This paper analyzes data from clinicaltrials.gov for selected few clinical trial sponsors and uses that info to create sankey diagram. A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages and data from clinicaltrials.gov is an excellent example to analyze a sponsor’s drug pipeline to see which clinical condition or interventions are focused by sponsor with respect to stages of clinical trials. Techniques such as data mapping, data analysis and data visualization are used to create the sankey diagrams displayed in this paper. Phase I clinical trials are excluded from data analysis and data visualization for ease of understanding the flow of clinical trials which are in Phase 2-4. Data is obtained in csv file format from clinicaltrials.gov using advanced search option and searching only for sponsor section. Analysis is performed on trials with status: "Active, not recruiting", "Available", "Enrolling by invitation, "Not yet recruiting", or "Recruiting". - 1 - SANKEY DIAGRAM FOR CLINICALTRIALS.GOV DATA Data obtained from clinicaltrials.gov in csv format is one record per trial, see figure 1. In order to use it for Sankey diagram, it needs to be processed as per below steps: • Data Mapping • Data Analysis • Data Visualization Figure 1: Data obtained from clinicaltrials.gov and imported into SAS® dataset. Sponsors listed in table 1 are considered in this paper for data analysis and to create sankey diagrams for the on-going clinical trials of each sponsor. Clinical trials with status: "Active, not recruiting", "Available", "Enrolling by invitation, "Not yet recruiting", or "Recruiting" are considered as on-going. Only those clinical trials are selected where sponsor is the lead sponsor of that clinical trial. Sponsor Distinct On-going Data Extraction Date Clinical Trials Count Sponsor 1 Bristol-Myers Squibb 250 22NOV2019 Sponsor 2 Janssen 126 Sponsor 3 Merck & Co. 173 22JAN2020 Sponsor 4 Amgen 56 Sponsor 5 Bayer 56 Table 1: List of Sponsors - 2- DATA MAPPING Data mapping is an essential component in order to connect links and nodes in sankey diagrams. Clinical trials data obtained from clinicaltrials.gov contains multiple names for same conditions (e.g.: “NSCLC”, “Non-Small Cell Lung Cancer”, or “Carcinoma, Non-Small-Cell Lung”), figure 2, and multiple names for same drug/biologic compounds (e.g.: "Nivolumab", "Opdivo", "BMS- 936558", "ONO-4538“), figure 3. Hence it is important to identify each condition and intervention into correct category. As there are numerous conditions, they are mapped into high-level categories like Solid Tumors, Cardiovascular, Leukemia & Lymphoma, etc. See figure 4 for example of mapping different conditions to high-level category. Figure 2: Mapping different names of same condition into single category. Figure 3: Mapping different names of same compound/intervention into single category. Figure 4: Mapping different conditions to high-level category. - 3- Below mapping rules are applied before data analysis step. The mapping rules are designed to identify the focus of the sponsor regards to clinical conditions/interventions. • Clinical trials with multiple phases are mapped toward the higher phase • Clinical trials with multiple clinical conditions are mapped towards each condition • Clinical trials with multiple interventions are mapped towards each intervention of the respective sponsor Example 1: Clinical trial NCT03331198, title “Study Evaluating Safety and Efficacy of JCAR017 in Subjects With Relapsed or Refractory Chronic Lymphocytic Leukemia (CLL) or Small Lymphocytic Lymphoma (SLL)”, has trial design for phase 1 and phase 2. As per the mapping rules, it will be mapped for Phase 2 only. This trial also has multiple clinical conditions listed such as Chronic Lymphocytic Leukemia, Small Lymphocytic Lymphoma, and will be mapped to each clinical condition as per the mapping rules. Example 2: Clinical trial NCT04088500, title “A Study of Combination Nivolumab and Ipilimumab Retreatment in Patients With Advanced Renal Cell Carcinoma” has multiple interventions: Nivolumab and Ipilimumab. As per the mapping rules, this trial will be mapped to each intervention listed. Example 3: Clinical trial NCT03036098, title “Study of Nivolumab in Combination With Ipilimumab or Standard of Care Chemotherapy Compared to the Standard of Care Chemotherapy Alone in Treatment of Patients With Untreated Inoperable or Metastatic Urothelial Cancer” has multiple interventions: nivolumab, ipilimumab, gemcitabine, cisplatin, carboplatin but only the first two are sponsor’s compounds, hence this trial will be mapped to two interventions: nivolumab & ipilimumab. Data mapping for this paper is performed by creating flags/identifiers for each condition and intervention listed in respective sponsor’s clinical trials data. Each sponsor listed in table 1 have unique compounds and mapping of each compound/intervention is required by closely observing the data. Data obtained from clinicaltrials.gov is one record per trial (horizontal data format) and it needs to be transformed into vertical data format as shown in figure 5 by using the flags created for each condition category and intervention. - 4- Figure 5: Horizontal data mapped and transformed into vertical data format DATA ANALYSIS Data analysis is performed by calculating number of objects with respect to its categories which needs to be displayed in sankey diagram. The categories are used as nodes and the count of those objects are used to determine the width of links between the selected categories. In this paper, data analysis is performed by calculating number of clinical trials with respect to sponsor, conditions, interventions, and phases. This step is performed after data mapping to ensure correct connection of links and nodes. SAS® macro %sankey_nodes is used for data analysis and reference code can be found in the appendix. %sankey_nodes(inds = ct_gov ,outds = sankey_out ,nodes=%str(sponsor|conditions|interventions|phases) ,cond = ); %sankey_nodes will calculate the number of objects, in this case, number of clinical trials. The mapped data is fed into “inds” macro parameter. The nodes (categories) which needs to be displayed in the sankey diagram are listed in “nodes” macro parameter and if any condition needs to be applied, it can be listed in “cond” macro parameter. This macro creates a macro variable &sankeydata. and output dataset which has data for sankey diagram stored in it. It gets used in the data visualization step to create sankey diagram. - 5- DATA VISUALIZATION Data visualization step is performed using SAS® macro %sankey2html and D3.js which is a JavaScript library. The output created is in HTML format. %sankey2html(indata = %nrbquote(&sankeydata.) ,outfl = %sysfunc(pathname(outg,f))/sankey.html ,width = 2100 ,height = 700 ,flow_num = ); %sankey2html macro reads macro variable &sankeydata. created from %sankey_nodes and implement it in HTML file. The output file location and HTML filename is specified in “outfl” macro parameter. “width” and “height” parameters are used for sankey diagram height and width. “flow_num” parameter is used to display link labels above a specified number. Sankey diagrams displays flow of number of clinical trials from SPONSOR → CONDITIONS → INTERVENTIONS → PHASES which are also used as nodes for sankey diagrams displayed in this paper. The thickness of the links signifies the number of clinical trials connecting the nodes. - 6- SANKEY DIAGRAM 1 Sponsor: Bristol-Myers Squibb Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher phase; Clinical trials with multiple clinical conditions are counted towards each condition; Clinical trials with multiple interventions are counted towards each intervention. Note: Data analysis and data visualization performed in this paper is not an official representation of any of the sponsor’s pipeline but based on the data acquired from clinicaltrials.gov. - 7 - SANKEY DIAGRAM 2 Sponsor: Janssen Node 1: Sponsor; Node 2: Clinical Conditions; Node 3: Interventions; Node 4: Clinical Trial Phases Number of on-going clinical trials for each node are displayed in parenthesis. Clinical trials with multiple phases are counted toward the higher

Load more