IDENTIFICATION OF MUTATIONAL LANDSCAPES IN

AFRICAN AMERICAN TRIPLE-NEGATIVE BREAST CANCER

by

JASPREET KAUR

Submitted in partial fulfilment of the requirements for the degree of

Master of Science

Systems Biology and Bioinformatics

School of Medicine

CASE WESTERN RESERVE UNIVERSITY

May, 2018

CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis of

JASPREET KAUR

Candidate for the degree of Master of Science

Committee Chair Dr. Vinay Varadan

Committee Member Dr. Kishore Guda

Committee Member Dr. Gurkan Bebek

Date of Defense April 3, 2018

*We also certify that written approval has been obtained

for any proprietary material contained therein

ii Table of Contents

Page

List of Tables v

List of Figures vi

Acknowledgements vii

Abstract viii

CHAPTER 1: INTRODUCTION 01

1.1 Triple-Negative Breast Cancer 03

1.2 Incidence and Mortality Rate 03

1.2.1 Incidence by breast cancer subtype and race 04

1.2.2 Mortality rates by breast cancer subtype and race 05

1.3 Hypothesis 07

CHAPTER 2: METHODS 08

2.1 Data Set 08

2.2 Technique Used and Tissue Description 09

2.3 Pre-Processing and Variant Calling 10

2.3.1 Read Alignment, Sorting and Replacing read groups 12

2.3.2 Mark and Remove Duplicates 14

2.3.3 Base Recalibration 15

2.3.4 Indel Realignment 20

2.3.5 Variant Calling 20

iii

CHAPTER 3: RESULTS 22

3.1 Number of variant calls reported by individual callers 22

3.2 Concordance between somatic variant callers 24

3.3 Fraction of overlap calls in individual callers 26

3.4 Mutation Burden across two ethnicities 28

3.5 Mutational frequencies in Discovery & Reference dataset 30

3.6 Significantly Mutated Genes 33

CHAPTER 4: DISCUSSION 42

CHAPTER 5: FUTURE DIRECTIONS 46

CHAPTER 6: CONCLUSION 51

REFERENCES 52

iv

List of Tables

Page

Table 1: Breast Cancer Subtypes 02

Table 2: Data Sets used in study 09

Table 3: Computational tools used to perform preprocessing

and variant calling 12

Table 4: Open-source computational tools for variant detection 21

Table 5: Mutation Frequencies of differentially

mutated genes across three datasets 32

Table 6: Mutational Frequencies of Breast Cancer

Driver genes across three datasets 33

Table 7: Significantly mutated genes 35

Table 8: Variants across significantly mutated genes 37

v List of Figures

Page

Figure 1: Female breast cancer incidence by Subtype and Ethnicity 04

Figure 2: Workflow for preprocessing sequencing data and somatic variant calling 11

Figure 3: Read Duplication rate in TCGA-Frozen and UH-FFPE 15

Figure 4: Effects of base recalibration in samples for base Substitution 17

Figure 5: Effect of base recalibration on base quality for Insertions 18

Figure 6: Effect of base recalibration on base quality for Deletions 19

Figure 7: Non-synonymous variants called by VarScan and MuTect2 23

Figure 8: Jaccard’s Index for non-synonymous calls in FFPE and Frozen samples 25

Figure 9: Fraction of overlapped calls in MuTect2 and Varscan in FFPE 26

Figure 10: Fraction of overlapped calls in MuTect2 and Varscan in Frozen 27

Figure 11: Tumor Mutation Burden across datasets 29

Figure 12: Differentially mutated genes across discovery dataset 30

Figure 13: IGV for CRIPAK gene showing frame shift insertion 36

Figure 14: IGV for KMT2C missense mutation 38

Figure 15: IGV for ANO1 gene missense mutation 38

Figure 16: Mutation Mapper visualization for KMT2C 39

Figure 17: Mutation Mapper visualization for ANO1 40

vi Acknowledgements

I feel grateful to be a part of Case Comprehensive Cancer Center at CWRU which hadn’t been

possible if Dr. David J. Alouani at UHTL wouldn’t have mentioned about Dr. Vinay Varadan when

I was searching for a Genomics lab.

This work was not possible without mentorship and support of Dr. Vinay Varadan, I thank

him for providing me this opportunity to work in a great research environment that has tremendous

opportunities for innovation and growth as a research professional. He has provided guidance

throughout my work while also allowing me to work independently the majority of the time. Many

thanks are extended to my thesis committee members, Dr. Kishore Guda and Dr. Gurkan Bebek,

for thoughtful advices, their time and attention, which had helped me framing my thesis.

Within the Varadan Lab, Salendra Singh has helped me in dealing with computational

problems and provided constant checkpoints emphasizing on what can be wrong, be it any task.

Dr. R Murthy through his direct communication helped clarifying my doubts in algorithms and

statistical methods of various bioinformatics tools. Communication with Dr. S Khalighi regarding

central dogma was fun. I would like to thank all of them for their support and motivation without

which I wouldn’t have reached this far. I will miss being a part of Varadan lab but I will join again

if Dr. Varadan moves to Harvard!

I would like to express my gratitude towards Dr. David J. Alouani for introducing me to

Genomics, Meka for his time in helping me understand HPC and linux back in 2016 when I was

struggling with it. I would want to thank my Dad who didn’t freak out after knowing the tuition fee

at CWRU unlike my Mom & calmly supported my decision. Finally, a big thanks to my friends

who were there as a constant support.

vii Identification of Mutational Landscapes in

African American Triple-Negative Breast Cancer

Abstract

by

JASPREET KAUR

African American women with Triple-Negative Breast Cancer (TNBC) exhibit poor

survival rate even after controlling for socioeconomic and treatment variation which

indicate presence of intrinsic biological differences contributing to the observed racial

disparities. Analysis of somatic mutation profiles of archival Formalin Fixed Paraffin

Embedded (FFPE) tumor tissues from discovery dataset of 36 African American TNBC

patients compared to publicly available control and validation datasets of 97 European

American and 51 African American TNBC samples respectively, from The Cancer

Genome Atlas (TCGA), revealed 15 differentially mutated genes across African American

(AA) and European American (EA) datasets. However, mutation frequency of known

breast cancer driver genes such as TP53 (AA =71% vs EA=75%, p=0.65) and PIK3CA

(AA=17% vs EA=12.36%, p=0.56) did not show significant differences between the two

ethnicities. Our results indicate the presence of novel genes that could delineate the

differences in TNBC outcomes across African American and European American

ethnicities.

viii

CHAPTER 1

INTRODUCTION

Breast carcinoma, highly heterogeneous disease, is the second most common malignancy in American women, with lifetime risk of being diagnosed close to 12.4% or

1 in 8 as of 2017. Socio-Economic conditions, access to health care, increased life expectancy, changes in reproductive patterns, parity, obesity, increased detection through screening are listed as potential reasons behind increase in risk of breast cancer incidence

[1].

Two distinct types of epithelial cell are found in the mammary gland: basal

(and/or myoepithelial) cells and luminal epithelial cells [2]. Luminal cells lie on the apical surface of ducts and are surrounded by basal cells. Breast cancer is categorized into five intrinsic subtypes Luminal A, Luminal B, HER2 enriched, Basal-Like and

Normal breast-like [3].

Based on presence or absence of three immunohistochemical (IHC) markers, estrogen receptors (ER), progesterone (PR) and Human Epidermal Growth Receptor 2

(HER2), a encoded by ERBB2 gene, breast cancer subtypes are described in

Table 1.

1

Table 1: Breast Cancer Subtypes. Characterization of intrinsic breast cancer subtypes by presence and absence of IHC markers, HR (Hormone Receptors) & HER2 (Human

Epidermal Growth Factor Receptor 2), shown by + and – sign respectively. Highlights the clinical features associated with these subtypes along with frequency of occurrence across breast cancer cases [4].

Subtype Biological Markers Distribution Features Less aggressive High survival rates Luminal-A HR+/HER2- 71% Lower rate of relapse Responsive to hormonal therapy (e.g. Tamoxifen) Aggressive phenotype Low survival rates compared to Luminal-B HR+/HER2+ 12-15% Luminal –A Respond to neo-adjuvant chemotherapy More aggressive and proliferative tumors HER2 HR-/HER2+ 5-10% Sensitive to cytotoxic agents (e.g. Enriched Doxorubicin) Improved survival outcomes with targeted therapies (e.g. Trastuzumab)

Show poor prognosis Triple - HR-/HER2- 12-15% Lack targeted therapies Negative

2

1.1 Triple-Negative Breast Cancer

Triple-Negative Breast Cancer (TNBC) is the immunohistochemical classification of breast cancer which lacks expression of ER, PR, and HER2 but expresses basal myoepithelial markers at higher levels. TNBC accounts for 12-15% of all breast cancers cases and is one of the most aggressive subtypes associated with higher risk of distant recurrence, high rate of metastases, higher probability of relapse and worse overall survival when compared to other subtypes [4]. 75% of TNBCs are considered basal-like by gene expression microarray profiling [5]. Breast Cancer related to BRCA1 often tends to show triple-negative phenotype and are positive for mutations in TP53, EGFR, and

Ki67 basal cytokeratins [4]. Basal-like cancers with poor BRCA1 pathway may respond to poly-ADP ribose polymerase (PARP) enzyme inhibitors [4]. Due to lack of targeted therapies and challenges in identifying key driver genes, chemotherapy is the treatment available for TNBC [4].

1.2 Incidence and Mortality rate

Increased rate of incidence over the years is a result of contributing factors such as changes in reproductive pattern, parity, genetic predisposition and largely because of increased use of mammography screening as cancers were being diagnosed 1-3yrs earlier than they used to be in absence of screening [1]. Non-Hispanic Black (NHB) women show increased incidence rate before 40yrs of age and lower rates at ages 65-84yrs but are more likely to die at every age when compared to Non-Hispanic White (NHW) even after improvement in treatment and early diagnosis [1]. Overall, Breast cancer mortality

3

rates declined in 2015 by 39% but racial differences still exist.

1.2.1 Incidence by breast cancer subtype and race

Prevalence of breast cancer differs by its subtypes. Figure 1, shows incidence rate by breast cancer subtype among different ethnicities, where incidence rate of TNBCs is twice as high in Non-Hispanic Black population (24 per 100,000) when compared to

Non-Hispanic Whites (12 per 100,000) [1].

Figure 1: Female breast cancer incidence rates per 100,000 women (y-axis) by Subtype (x-axis) and Ethnicity (colored) (2010-2014). Triple-Negative subtype in Non-Hispanic Black (NHB) shows 2 times higher incidence in when compared to Non-Hispanic White (NHW) & ~3 times higher when compared to other races. Data derived from C. E. DeSantis et al (2017) [1].

High frequency mutations in BRCA1 gene (tumor suppressing gene) is considered to be the factor promoting disparity in incidence rate but recent studies reported that African

4

Americans had a lower rate of deleterious BRCA1 and BRCA2 mutations [6] but a higher rate of sequence variations in genes other than BRCA1/2 (27.9% vs 46.2% and 44.2% vs

11.5%; P<.001 for overall comparison) [7]. Of the variations found in African American women only a few mutations result in structural defects that block BRCA1 function but these defects occur in women of African origin at a lower rate than observed in women of

European descent [8]. Akinyemiju TF et al., showed no association between incidence of

TNBC and socioeconomic status [9], considered to be another contributing factor. Hence, these studies strongly conclude that socio-economic factors and genetic predisposition are not driving incidence of triple-negative breast cancer in African Americans.

1.2.2 Mortality rates by breast cancer and its subtypes

Overall breast cancer death rates significantly vary by breast cancer subtype and race. Ademuyiwa FO. et al. (2017), showed African American with breast cancer had a shorter time to progression (hazard ratio 1.5, p-value 0.012) as well as worse disease free survival than Caucasians [10]. They even concluded that African American women with basal-like breast cancer showed worse survival rate when compared to European

American women with basal-like breast cancer.

The potential factor thought to be associated with this racial disparity was socio- economic conditions but several studies have shown that even after controlling for socioeconomic conditions, African American ethnicity shows poor prognosis for breast cancer and its specific subtype, TNBC.

5

A meta-analysis of 14 studies involving over 10,000 African American patients and

40,000 White American patients conducted by LA Newman et al., showed that mortality odds ratio after controlling socioeconomic status was 1.27 (95% C.I, 1.17-1.38). They also showed a subset meta-analysis of studies based on income or insurance data and access to health care individually where mortality odds ratio was found to be 1.23 (95%

C.I, 1.14–1.34) and 1.35 (95% C.I, 1.00–1.82) respectively. Hence, they concluded that

African American ethnicity is the individual predictor of worse breast cancer outcomes

[11].

Triple-Negative Breast Cancer when compared to other subtypes of breast cancer as per

Bauer et al.(2007) [12], Lund et al.(2009) [13], Dietz et al.(2015) [14], shows reported mortality odds ratio 1 and 1.12 in African American patients having high and low socioeconomic status, respectively. African American TNBC survival outcomes are poor if compared to Luminal-A or to European American TNBC [14].

The above mentioned studies concluded that survival rate of Triple-Negative Breast

Cancer is worse even after adjusting for socioeconomic status when compared by race and breast cancer subtype suggesting that other genetic and epigenetic mechanisms promote poor prognosis in African American population.

6

1.3 Hypothesis

Studies discussed above suggest that socioeconomic conditions and/or healthcare access are insufficient to explain outcome disparities in African American (AA) vs European

American (EA) TNBCs, based on which we hypothesize that intrinsic biological differences other than germline events exist contributing to the observed disparities in

AA and EA TNBCs.

Therefore, we looked into the somatic mutation profiles of Triple-Negative breast cancer arising in African American and European American patients to determine:

● If Tumor Mutation Burden is significantly different in the two ethnicities?

● If there is a significant difference in frequency of genes mutated in AA compared to EA?

7

CHAPTER 2

METHODS

In this section, we focus on the steps applied to observe differences in mutational profiles of the two ethnicities and will also highlight the impact of preprocessing steps on downstream variant calling.

2.1 Data Set

We accrued unique cohort of archival Formalin Fixed Paraffin Embedded (FFPE) tumor tissues from 36 AA TNBC patients (discovery dataset) from University Hospitals,

Cleveland (OH). For comparative analysis of mutational landscapes in AA and EA

TNBCs, we integrated 97 EA TNBC frozen samples (control dataset) and 51 AA TNBC frozen samples (validation dataset) out of 1097 total Breast Cancer samples reported through The Cancer Genome Atlas (TCGA) via Genomics Data Common Data Portal

(GDC) as shown in Table 2.

The clinical information about samples was gathered using TCGAbiolinks [15], an R package. The samples which had IHC status of ER, PR and HER2 negative were downloaded from GDC Data Portal. Samples that were earlier not given HER2-negative status were appended in our list from a recent study conducted by Huo D et al.(2017)

[16].

8

Table 2: Data Sets used in this study. Following table provides information on Dataset

names, Tumor tissue used, Ethnicity, Number of samples and Location of accrual.

DATASET TISSUE ETHNICITY # SAMPLES LOCATION

Discovery FFPE African American (AA) 36 Cleveland, OH

Validation Frozen African American (AA) 51 TCGA

Control Frozen European American (EA) 97 TCGA

2.2 Technique Used and Tissue Description

Whole exome sequencing was performed on discovery cohort of archival

Formalin Fixed Paraffin Embedded (FFPE) tumor-matched normal tissues. The platform used for whole exome sequencing was Illumina HiSeq 2500 generating paired-end reads of ~150bp. This strategy includes targeting protein-coding regions of genome which accounts for <2% of human genome but contains 85% of all mutations associated with complex genetic disorders. Paired-end approach provides accurate read alignment, comprehensive coverage of coding regions and ability to detect insertions and deletions

(not possible with single-end reads). Discovery dataset contained on an average 109 million reads per tumor-matched normal sample with an average coverage depth of

151X. The percentage of sequenced bases aligning to exome was 85% thus suggesting strong target enrichment. Similarly, whole exome sequencing data was downloaded for

Frozen samples obtained from TCGA data portal.

9

Formalin fixed paraffin embedded fixation preserves fragile structures inside and between the cells in the tissue which makes FFPE tissue great source for morphology and

IHC analyses but have certain disadvantages in regards with molecular analyses [17].

When compared to DNA isolated from frozen sample, FFPE exhibits high frequency of non-reproducible sequence alterations due to formation of formalin-cytosine crosslink on either strand as a result PCR creates artificial C-T or G-A mutation [17]. Non-recognition of errors such as duplicate reads, false variants and quality of base called affect downstream analyses. Hence, to overcome these issues we have optimized workflow to detect somatic mutations shown in Figure 2.

2.3 Pre-Processing and Variant Calling

The data obtained after sequencing runs is fragmented, incoherent and subject to sequencing errors. Therefore, before calling actual variants, step by step preprocessing is required as shown in Figure 2. Each process was stacked in a pipeline constructed using scripting language, Python, which on execution allocates resources from high performance computing cluster (HPC) with SLURM as its job scheduler. Efficient use of a few parameters (discussed later) available in each tools, adjustments in memory, CPU requirements and stringent filters on coverage & variant allele count made the pipeline more reliable and robust in nature.

10

Figure 2: Workflow for preprocessing sequencing data and somatic variant calling. FFPE and Frozen samples were processed using same pipeline to ensure no bias in preprocessing of all samples. Frozen samples underwent an additional step of converting BAM files (aligned to GRCh38) to FASTQ format so as to realign all samples to GRCh37 reference.

11

Table 3: Computational tools used to perform preprocessing and variant calling.

Computational Tool Used Function Toolkits

SAMtools 1. view 1. Prints and views all alignments 2. mpileup 2. Groups alignment records in pileup format

BEDtools 1. genomecov 1. Reports per base coverage in given genome 2. intersect 2. Finds overlap between two sets of genomic features

PICARD Tools 1. SamToFastQ 1. Convert SAM/BAM to FASTQ format 2. SortSam 2. Sort SAM file by coordinate 3. AddOrReplaceReadGroups 3. Adds or Replace Read groups in a BAM file 4. MarkDuplicates 4. Mark and remove duplicates in BAM 5. CalculateHsMetrics 5. Collects metrics that are specific for sequence data generated through Hybrid Selection

Genome 1. BaseRecalibrator 1. Calibrates reported base quality Analysis 2. PrintReads 2. Prints reads in a BAM file ToolKit 3. RealignerTargetCreator 3. Creates target intervals for local realignment (GATK) 4. IndelRealigner 4. Perform local realignment around indels 5. MuTect2 5. Somatic variant caller 6. AnalyzeCovariates 6. Provide covariates during recalibration

Varscan2 1. somatic 1. Somatic variant caller

2.3.1 Read Alignment, Sorting and Replacing read groups

Burrows Wheeler Aligner Maximal Exact Matches (BWA-MEM) algorithm [18] was used to align all the three datasets because it support reads up to length 1Mbp and is fast when compared to BWA-backtrack algorithm. Next-generation sequencing generates short reads ~200bp without any position details highlighting mapability issue. The aligners, map these reads to corresponding region of control genome while tolerating certain amount of mismatch to allow subsequent variant detection.

12

Whole Exome sequencing data for validation and control datasets from TCGA, downloaded in binary alignment map format (BAM), was originally aligned to GRCh38 control genome due to which these files were first converted from BAM format to

FASTQ format and realigned to HG19/ GRCh37 using BWA-MEM algorithm.

The result of alignment is a Sequence Alignment Map format (SAM) file, which serves as an input to second step of preprocessing. Aligners align all reads in random order with respect to their order in FASTQ files, hence to sort the SAM file in a genomic order we have used SamSort from Picard Tools toolkit. SamSort positions reads based on genomic coordinates on each and generate a sorted Binary/Sequence

Alignment Map format (BAM) file along with BAI index.

A set of reads generated from a single run of a sequencing instrument is termed as read group. Information on read groups is required by certain tools to be present in input files and will fail with errors if this requirement is not satisfied. For our study Platform used was Illumina (RGPL), samples were either named Tumor and Normal (RGSM). The tool used was AddOrReplaceReadGroups from Picard tools and output of this step was a BAM file.

13

As mentioned above FFPE samples are being used to evaluate exome sequencing profiles, we anticipate significant amount of artifacts, due to the chemistry of FFPE samples, that might hamper the quality of downstream analysis, so here we asked:

• Whether FFPE samples show high duplication rate when compared to frozen samples?

• Whether the reported base qualities have high/low accuracy?

These questions were addressed by marking duplicates and recalibrating reported base qualities in the both FFPE and Frozen data set.

2.3.2 Mark and Remove Duplicates

Set of reads sampled from same template of DNA and having same unclipped alignment start & end are termed to as duplicate reads. While performing Polymerase

Chain Reaction (PCR), if more than 1 copy of segment hybridizes to different flowcell lawn, sequencer will end up reading same DNA in two wells, these repeats are termed as

PCR duplicates. Optical duplicates on other hand arise from single amplification cluster incorrectly detected as multiple clusters by the optical sensor of the sequencing instruments. The tool used for marking and removing duplicates is MarkDuplicates provided by Picard tools. This tool rank reads by the sum of their base quality scores to differentiate between primary and duplicate reads.

14

Figure 3: Read duplication rate in TCGA-Frozen (EA+AA) and UH-FFPE samples. Beeswarm plot showing data sets on x-axis and percentage read duplication on y-axis. Dark black line in the middle indicates median percentage read duplication for FFPE ~15.9% and for Frozen it is ~12.6%. Student’s t-test indicate significantly higher duplication rate in FFPE samples (p < 0.003).

After looking at the duplication rates in both samples we concluded that FFPE samples, as expected, show significantly (p<0.003) high percentage of duplicates (~16%) when compared to Frozen samples (~13%) shown in Figure 3.

2.3.3 Base Recalibration

Base recalibration is a data preprocessing step that detects errors in the base quality score provided to each base called. These quality scores express how confident a

15

machine is to call a particular base at given position for instance, a quality score of 20 in phred scale means that the sequencer is 99% sure that base is called correctly.

Phred quality score is logarithmically related to the error probability (PE) and calculated as:

Q = -10 log10 (PE)

Bases called by machines are subject to chemistry errors during synthesis in sequencing, sequencer captured random noise or systematic bias, producing over or under-estimated base quality scores which incorporates more false positive or false negative calls in resulting data.

Here we looked at, how good or bad were the reported qualities for FFPE samples before calling actual variants?

The base quality graph is reported for three events: Base Substitution, Insertions and

Deletions in Figure 4, Figure 5 and Figure 6 respectively. The accuracy of quality after recalibration is a score provided as a difference of empirical quality and reported quality of the base at a position. The empirical quality is the ratio between mismatches and total bases defined by a read group, machine cycle and dinucleotide pattern.

Empirical quality = -10 log10 (No of mismatches/Total bases defined by cycle)

Accuracy = Empirical Quality – Reported Quality

Confidence in bases called by sequencer increases if difference in Empirical and

Reported quality is close to 0.

16

Figure 4: Effect of base recalibration on base quality in samples for base substitution. X-axis shows a machine cycle defined as each base at its position in a read & Y-axis shows accuracy stated as difference in Empirical and Reported quality score. (A) FFPE samples, low accuracy observed in quality scores of preceding and succeeding bases at a particular cycle before recalibration than quality after recalibration (B) Frozen samples, not much difference observed in quality scores of preceding and succeeding bases at a particular cycle before and after recalibration. All samples were considered.

According to the graph, reported base qualities in FFPE samples tend to have low accuracy when compared to the reported base qualities in Frozen samples. After recalibration, accuracy of substitution in FFPE samples increase thus raising confidence that base called at position in a read is true.

17

Figure 5: Effect of base recalibration on base quality in samples for insertions. X- axis shows a machine cycle defined which is each base at its position in a read & Y-axis shows accuracy stated as a difference in Empirical and Reported quality score. For an insertion, base before recalibration at a particular cycle had low reported quality scores thereby increasing accuracy a subject to errors in (A) FFPE samples and (B) Frozen samples. Accuracy adjusted after recalibration increases the confidence of base called.

The graph in Figure 5 shows similar trend of base quality score before recalibration in

FFPE and Frozen samples indicating low reported base quality scores for insertions whereas, Figure 6 show high reported base quality scores for deletions, in general irrespective of tissue type.

18

Figure 6: Effect of base recalibration on base quality in samples for deletions. X-axis shows a machine cycle which is each base at its position in a read & Y- axis shows accuracy stated as difference in Empirical and Reported quality score. For a deletion, base before recalibration at a particular cycle had high reported quality scores thereby decreasing accuracy subject to errors in (A) FFPE samples and (B) Frozen samples. Accuracy adjusted after recalibration increases the confidence of base called.

The trend in accuracy of reported base qualities for indels lie in the same ballpark for both FFPE and Frozen samples, indicative of over-estimation of accuracy for insertions and under-estimation of accuracy in quality scores for deletions, by sequencer.

This analysis FFPE samples are more prone to artifacts such as duplication error and reported base quality errors which produce biases in downstream variant analysis

19

reporting false variants. In general sequencing data is prone to these errors but extent varies from tissue type being sequenced, hence these steps are highly advisable to control for biases in downstream analysis.

2.3.4 Indel realignment

A large percent of regions requiring local realignment is due to the presence of insertions or deletions (indels). Such alignment artifacts result in many bases mismatching the control near the misalignment, which are easily mistaken as SNPs.

Therefore, local realignment serves to transform regions with misalignments due to indels into clean reads further improving accuracy of variant calling.

2.3.5 Variant Calling

Detection of germline and somatic variants by whole exome and genome sequencing utilize tools based on statistical score or a probability test along with simple heuristic methods listed in Table 4.

True mutations at low allelic frequencies can be subdued due to sequencing chemistry errors, tumor heterogeneity and sub-clonality within the cancer population.

Hence, a variant caller with high sensitivity and specificity for a true variant signal even with low allelic fractions is an utmost requirement [19].

Each variant caller listed has its own strengths and limitations. Therefore, multi- caller strategy is generally applied on the basis of known fact that candidate event called by several independent algorithms is significantly less likely to be a false positive than called by a single caller.

20

Table 4: Open-source computational tools for variant detection. MuTect2 and Varscan2 were used based on high performance scores, callers obtained in mutation challenge conducted by The Cancer Genome Atlas [21].

Tools Analysis Synopsis Used (Y/N)

MuTect2 SNV/indel detection Bayesian Classifiers Y

Varscan2 SNV/indel detection Fisher’s Exact & FDR Y

Somatic Sniper SNV/indel detection Bayesian probability N

Strelka SNV/indel detection Bayesian probability N

SNVMix SNV detection Binomial mixture N model

ICGA-TCGA somatic mutation algorithm challenge [20], have tested various callers for their accuracy and low false positive rate and concluded that with increasing tumor cellularity and sub-clonality in sample-set 3, MuTect from Broad Institute and

Varscan from Washington University, achieved high F-score of 0.949 and 0.941 respectively based on low false positive rate.

Two variant calling algorithms, MuTect2 and Varscan2 were used for this analysis because even after controlling for artifacts such as read duplication and base quality errors, challenges such as tumor coverage, tumor allele frequency, tumor subclonality and heterogeneity exists while calling mutations, emphasizing the need of stringent filters and multi-caller strategy to reduce the rate of false positives thereby increasing accuracy of downstream analysis.

21

CHAPTER 3

RESULTS

We performed whole exome sequencing on discovery dataset comprising 36 AA

FFPE TNBC samples. Publicly available control dataset of 97 EA TNBC samples and validation dataset of 51 AA TNBC samples were downloaded using TCGA. These sample sets underwent a series of data preprocessing steps and variant calling algorithms

MuTect2 and Varscan2, as detailed in Figure 2 of Chapter 2. The resulting somatic mutational profiles of AA and EA triple-negative breast cancer were analyzed as outlined in this Chapter.

3.1 Number of variant calls reported by individual callers

The variant calls from both callers were obtained in a VCF format for FFPE and

Frozen samples. Figure 7, displays number of non-synonymous variants called by both callers at tumor read depth 30X and tumor allele read count 2. Varscan in FFPE samples showed significantly higher number of calls (p-value << 0.001, C.I 95%) when compared to Mutect2 whereas in TCGA-Frozen samples, variants called by Varscan and Mutect2, did not show significant difference, were falling in the same ball park showing median calls as 51 and 58 respectively. Thus, we observed that for analytically challenging FFPE samples, Varscan tends to call more variants raising concern about interference of false positives in downstream analysis.

22

Figure 7: Number of Non-Synonymous variants called by Varscan2 and Mutect2 in FFPE and Frozen samples at tumor read depth 30 and tumor allele count 2. Black line in the middle of boxplot shows the median of variant calls. Student’s T-test at 95% CI indicate significantly high number of Varscan2 calls (p<<0.001) for FFPE and no difference was observed in Frozen calls (p=0.12).

23

3.2 Concordance between somatic variant callers

Jaccard’s index, a statistical measure to compare similarity in between two data sets ranging from 0% to 100%, was used to identify percentage of overlapping calls called by both callers. Jaccard’s Index was measured at different allele frequencies and tumor read depth so as to identify one combination where the overlapping calls are maximum. Multi-caller strategy was applied to retaining high confidence calls on the basis of known fact that candidate event called by several independent algorithms is significantly less likely to be a false positive than called by a single caller. Based on our high number of Varscan2 calls we anticipated the overlap in between the two callers would be more than 80% but the results obtained depicted less number of calls overlapped in between the two callers. The same trend was shown in TCGA-Frozen samples as well. Figure 8, shows Jaccard’s Index (y-axis) for Non-synonymous calls in

FFPE and Frozen samples at different coverage (x-axis) and tumor allele frequency

(represent different color).

Jaccard’s Index = ( |A∩B| ) / ( |A| + |B| - |A∩B| )

24

Figure 8: Jaccard’s index for Non-synonymous calls in FFPE and Frozen samples. No increase in Jaccard’s index after coverage 30X in FFPE (A) whereas, remains constant for all allele frequencies and coverage in Frozen samples (B). Indicates low concordance in MuTect2 and VarScan2 for FFPE and Frozen.

The fraction of calls found by the Mutect2 that are also reported by Varscan2 was close to 17% at 30X Tumor read depth and 7% allele frequency, showing low percentage of overlap in two somatic mutation callers. The low concordance in

Varscan and Mutect2 (20-30%) was also reported by Cai L et al. [19] and Krøigård AB. et al. [21].

25

3.3 Fraction of overlap calls in individual callers

As the concordance of Mutect2 and Varscan2 was relatively low in both data sets, we wanted to move forward with analyzing results of one caller for which we looked at the fraction of overlapped calls called by individual callers. Variant calls from caller showing high percentage of overlapped calls in both data sets was chosen for further analysis. For instance, if A is the set of MuTect2 calls and B set of Varscan2 calls, then fraction is the likelihood of Varscan calls called by MuTect2 as well & vice versa.

Fraction = (|A∩B|) / (|A|)

Figure 9: Fraction of overlapped calls in Mutect2 and Varscan2 for FFPE samples; Fraction (y-axis) and Coverage - tumor read depth (x-axis). (A) Fraction of Varscan2 calls that were called by MuTect2 as well (B) Fraction of MuTect2 calls that were called by Varscan2 as well. Mutect2 shows high fraction of overlapped calls than Varscan2.

26

Panels (A) and (B) in Figure 9 shows fraction of overlapped calls in MuTect2 (70-75%) and Varscan2 (17-20%) for FFPE samples, respectively. The results indicate high fraction of overlapped calls in MuTect2 for FFPE samples.

Figure 10: Fraction of overlapped calls in Mutect2 and Varscan2 for FFPE samples; Fraction (y-axis) and Coverage - tumor read depth (x-axis). (A) Fraction of Varscan2 calls that were called by MuTect2 as well (B) Fraction of MuTect2 calls that were called by Varscan2 as well. Mutect2 shows high fraction of overlapped calls than Varscan2.

27

Similarly, for frozen samples, panels (A) and (B) in Figure 10 shows fraction of overlap in MuTect2 (75-80%) and Varscan2 (~40%), respectively. The fraction obtained indicates high fraction of overlapped calls in MuTect2 for Frozen samples.

This analysis illustrates that fraction of overlap calls was higher in MuTect2 for both

FFPE and Frozen samples, therefore, results from MuTect2 were taken into consideration for further analysis.

3.4 Mutation distribution across the two ethnicities

Tumor mutational burden (TMB) for our analysis was defined as somatic, non- synonymous, base substitution and indel mutations per megabase of genome examined.

All base substitution and indels variants after applying stringent thresholds, tumor read depth (30) and number of reads supporting tumor allele (2), were examined. To calculate tumor mutational burden in all samples, total number of variants were divided by the number of positions showing 30X tumor read depth in that particular sample.

TMBsample = (Number of variants after filtering / Positions in sample showing 30x Tumor read depth)*1000000

Analysis of 100,000 human cancer genome, a study conducted in 2017 [22], revealed tumor mutation burden of breast carcinoma to be somewhere around 3mutations/mb.

The hypothesis of significant difference in mutational distribution was addressed in

Figure 11, plot describing tumor mutational burden across three data sets, Discovery dataset (UH-FFPE-AA), Control dataset (TCGA-EA) and Validation dataset (TCGA-

AA). 4 samples from control dataset and 1 sample from discovery dataset showing

28

TMB>10 mutations/mb were categorized as hyper mutated and excluded from further analysis.

Figure 11: Tumor Mutational Burden across three data sets. Student’s t-test revealed TMB between Validation Set (TCGA-AA) & Control Set (TCGA-EA) and Discovery Set (UH-FFPE-AA) & Control Set (TCGA-EA) did not show any significant difference, p-value 0.11 and 0.95 respectively.

No significant difference was observed in triple-negative breast cancer tumor mutational burden when stratified by race. The mean TMB across UH-FFPE samples was 1.02 mutations/mb, TCGA-AA was 0.85 mutations/mb and TCGA-EA was 1.08 mutations/mb.

29

As per this analysis, we do not observe any significant differences in Mutational distribution between two ethnic groups with Triple-Negative Breast Cancer.

3.5 Mutational frequencies in Discovery & Control dataset

As we confirmed, the rate of mutations per million bases remains same irrespective of race, our next hypothesis was to check if there are differentially mutated genes between the two datasets (discovery and control) and of these genes how many show significantly higher mutation frequency?

Figure 12: Differentially mutated genes across discovery dataset. x-axis is the genes mutated and y-axis is the logarithm of probability of difference in mutation frequency between discovery and control dataset, C.I 95%. Fisher’s exact test was used to draw significance at alpha=5%. Yellow highlighted genes had p-value<0.05 and genes highlighted black showed no significant difference (p>0.05) in percentage of samples mutated.

30

Figure 12, shows differentially mutated genes (x-axis) and logarithm of probability of difference in mutation frequency between discovery dataset and control dataset (y-axis). Breast cancer driver genes (TP53, PIK3CA, PIK3R1, PTEN, MAP3K1) highlighted black with nominal p-value>0.05 indicated no difference in mutational frequencies across two datasets whereas, potential genes with nominal p-value<0.05 highlighted yellow indicated significant differences in mutation frequencies across discovery and control datasets. Mutation frequency was calculated as ratio of samples mutated and total samples in a dataset. Here, two mutations in same sample for a gene was counted as one. Significance of differentially mutated genes across the discovery dataset was calculated using two tailed Fisher’s exact test at 95% CI.

15 genes showed significant differences in mutational frequencies in discovery and control dataset (listed in Table 5), out of which ANO1 (p= 0.001) had highly significant differences in mutation frequencies whereas, NOTCH1 (p = 0.02) showed least significant differences. KMT2C (a.k.a MLL3) showed significantly higher frequency of 17% in discovery dataset (p=0.006) than validation dataset (9.8%) when compared to control dataset (2.25%). BRCA1 and BRCA2 showed no non-synonymous mutation for discovery dataset. PIK3CA, despite showing no significant differences in frequency

(Table 6), accounts to ~17% mutation in discovery dataset, whereas 12.3% in control dataset and 5.8% in validation dataset. NOTCH1 showed significant difference (p = 0.02) with mutational frequency 11% in discovery dataset and 1.12% in control dataset, whereas for validation dataset frequency was 1.96% (Table 5). TP53 was mutated in 71-

75% of samples across the three datasets (Table 6). PIK3R1, MAP3K1, PTEN, ERBB2

31

showed mutational frequency of 2.8% in discovery set, whereas in control set, PIK3R1,

PTEN showed 4% mutation frequency and ERBB2 showed mutation in ~1% of all samples MAP3K1 did not show any mutation (Table 6).

Table 5: Mutation Frequencies of differentially mutated genes across three datasets. 15

genes show significant differences (p<0.05) in mutation frequencies across discovery

dataset when compared to control dataset.

Mutation Mutation Mutation Fisher’s Fisher’s Genes Frequency Frequency Frequency P-value P-value Discovery Validation Control (Discovery v/s (Validation v/s Dataset Dataset Dataset Control) Control) N=36 N=51 N=89

ANO1 14.29% 0.00% 0.00% 0.0014 1.0000

CRIPAK 11.43% 0.00% 0.00% 0.0056 1.0000

KMT2C 17.14% 9.80% 2.25% 0.0063 0.0992

ANO8 8.57% 0.00% 0.00% 0.0211 1.0000

BCAN 8.57% 1.96% 0.00% 0.0211 0.3643

KCNB2 8.57% 0.00% 0.00% 0.0211 1.0000

LYZL2 8.57% 0.00% 0.00% 0.0211 1.0000

MDC1 8.57% 1.96% 0.00% 0.0211 0.3643

NLRP5 8.57% 1.96% 0.00% 0.0211 0.3643

NPIPB15 8.57% 0.00% 0.00% 0.0211 1.0000

OR2T33 8.57% 0.00% 0.00% 0.0211 1.0000

PDE4DIP 8.57% 0.00% 0.00% 0.0211 1.0000

TLN2 8.57% 1.96% 0.00% 0.0211 0.3643

TMEM128 8.57% 0.00% 0.00% 0.0211 1.0000

NOTCH1 1.96% 1.12% 0.0221 1.0000 11.43%

Two tailed fisher’s exact p-value was corrected using Benjamini-Hochberg FDR method and we observed no significant differences in three datasets.

32

Table 6: Breast Cancer Driver Genes showing no significant differences (p>0.05) in mutational frequencies across three datasets.

Mutation Mutation Mutation Fisher’s Fisher’s Genes Frequency Frequency Frequency P-value P-value Discovery Validation Control (Discovery v/s (Validation v/s Dataset Dataset Dataset Control ) Control ) N=36 N=51 N=89 PIK3CA 17.14% 5.88% 12.36% 0.5636 0.2575

TP53 71.43% 74.51% 75.28% 0.6550 1.0000

PIK3R1 2.86% 1.96% 4.49% 1.0000 0.6527

PTEN 2.86% 5.88% 4.49% 1.0000 0.7053

CDH1 5.71% 0.00% 1.12% 0.1919 1.0000

MAP3K1 2.86% 0.00% 0.00% 0.2823 1.0000

ERBB2 2.86% 5.88% 1.12% 0.4865 0.1372

ERBB3 0.00% 1.96% 2.25% 1.0000 1.0000

BRCA1 0.00% 1.96% 3.37% 0.5580 1.0000

BRCA2 0.00% 3.92% 6.74% 0.1829 0.7105

ABCA13 11.43% 1.96% 2.25% 0.0528 1.0000 EGFR 8.57% 0.00% 1.12% 1.0000 0.0677

Correction of nominal p-values (Fisher’s Exact Test) for all genes was done using

Benjamini-Hochberg FDR method and we observed no significant differences in three datasets which was expected due to small sample size.

3.6 Significantly Mutated Genes

To differentiate genes in which mutations occurred by chance and genes with actual mutations contributing to TNBC, probability of number of variants out of total number of bases in a particular gene from total number of mutant and non-mutant bases

33

covered at 30X across cohort, given the background mutation rate was calculated using

Hypergeometric test.

m n m+n p(x) = ( Cx . C k-x) / Ck where,

m = Total number of mutant bases covered at 30x across cohort

n = Total number of non-mutant bases covered at 30x across cohort

k = Total number of bases in particular gene covered at 30x

x = number of variants in gene covered at 30x

R software provides function phyper to perform hypergeometric test and padjust with

Benjamini-Hochberg multiple comparison test for p-value correction.

Genes with actual mutations than to be observed by chance (corrected p-value < 0.05) were flagged as significantly mutated genes (SMG). Out of 15 differentially mutated genes having fisher’s p-value < 0.05 (Table 5), only 7 genes were significantly mutated in discovery cohort of 35 African American TNBC patients having average background mutation rate 1.5845E-09.

34

Table 7: Significantly Mutated Genes. 7 out of 15 differentially mutated genes were

found to be significantly mutated in discovery dataset.

Genes Discovery Dataset Discovery Dataset SMG SMG-TCGA (Nominal Pr) (FDR) Discovery (Validation+Control) CRIPAK 0.000000003 4.64601563037786E-05 SMG NO_SMG OR2T33 8.13724260506796E-08 0.0012562275 SMG NO_SMG

TMEM128 3.46326096159129E-07 0.0053462359 SMG NO_SMG

LYZL2 5.27673907878875E-07 0.0081451744 SMG NO_SMG

ANO1 1.20687937188967E-06 0.0186257693 SMG NO_SMG

ANO8 1.62660117360408E-06 0.0251017093 SMG NO_SMG

KMT2C 1.8125091594974E-06 0.0279688288 SMG SMG

PIK3CA 4.43866669645271E-11 6.85330137932298E-07 SMG SMG

TP53 6.05885394560399E-52 9.35547637740711E-48 SMG SMG

NPIPB15 4.25672233435324E-06 0.0656727122 NO_SMG NO_SMG

KCNB2 8.4309429644653E-06 0.1300051405 NO_SMG NO_SMG BCAN 6.20128268713743E-05 0.9419669737 NO_SMG NO_SMG MDC1 0.0002519225 0.9419669737 NO_SMG NO_SMG NLRP5 6.15389102184376E-05 0.9419669737 NO_SMG NO_SMG

PDE4DIP 0.0053548833 0.9419669737 NO_SMG NO_SMG

TLN2 0.0044695936 0.9419669737 NO_SMG NO_SMG

NOTCH1 7.57930742326473E-05 0.9419669737 NO_SMG NO_SMG

Table 7, shows the probability (nominal pr) of genes were significantly mutated across its entire length and corrected probability using Benjamini-Hochberg FDR method

(FDR) in discovery dataset. Out of 7 genes that were found to be significantly mutated in discovery dataset, only KMT2C was significantly mutated in TCGA (Validation+Control) data as well. Breast cancer driver genes such as TP53, PIK3CA considered to be significantly mutated according to the publically available data (Tumor portal, cbioportal) were in concordance with our data. NOTCH1, being frequently mutated with fisher’s p- value 0.02, was not significantly mutated in our cohort and in TCGA. Variants in

35

significantly mutated genes specific to discovery cohort KMT2C, ANO1, CRIPAK,

OR2T33, ANO8, TMEM128, LYZL2 were manually curated using Integrative Genome

Viewer (IGV) to check if these variants are not an FFPE artifact and are a true variant.

Examples of variant curation using IGV are shown in Figure 13, 14 & 15.

Figure 13: CRIPAK gene insertion at position chr4:1388788-1388789 for FFPE sample TN76. An instance but have few red blocks that show insertion is larger than expected deletion. Insertion took place in a repeat region.

CRIPAK, Cysteine-rich PAK1 inhibitor, an intronless gene at locus 4p16.3. represented in Figure 13, inferred insert size of the control genome is larger than expected given the actual insert size, was excluded from our analysis as we observed area nearby position chr4:1388788 to be noisy due to presence of frequent point mutations this, position itself showed silent and a frameshift insertion simultaneously, in addition to this it was falling in a repeat region of a gene.

36

Table 8: Variants across significantly mutated genes. 25 Somatic, Non-Synonymous

missense and indel mutations for 6 SMGs in discovery dataset along with in-silico

prediction scores by Polyphen, SIFT and MutationTaster are listed.

MutationTaster : A – disease causing automatic ; D – disease causing; N- benign.

Polyphen Scores : D – probably damaging (0.85-1.0), P - possibly damaging (0.15-1.0), B – benign (0.0-0.15)

SIFT Scores : D – damaging (0.0-0.05) ; T – tolerated (0.05-1.0)

Details for variants observed in SMGs across discovery dataset were described in

Table 8 along with insilico prediction of effects these variants might pose. PolyPhen2,

SIFT and MutationTaster were the three variant effect prediction tools. MutationTaster score is the probability of prediction i.e. if value is close to 1 it indicates high confidence in that prediction is correct. PolyPhen2 score represents probability that a substitution is damaging i.e. higher the probability more detrimental is the effect. SIFT score use the same range as PolyPhen2 but with opposite meaning, SIFT score<0.05 is detrimental and

SIFT>0.05 is benign or tolerated. A variant is deleterious or benign if two out of three callers show the evidence.

37

Figure 14 and Figure 15 show an instance of two IGV panels for KMT2C and ANO1, respectively, showing true variant calls at particular position highlighted in panel.

Figure 14: KMT2C gene missense mutation at Figure 15: ANO1 gene missense mutation at chr7:151,876,953 for FFPE sample TN65. Red chr11:69,970,477 for FFPE sample TN54. Red marks show mutation of T in place of C. highlight shows mutation of A in place of C.

Simultaneously, other genes were curated to check if there are any false positives.

KMT2C and ANO1 being the top hits showing high significant mutation frequency when compared to the other FFPE-SMGs were selected as genes of interest. These were individually analyzed for their protein coding domains using mutation mapper a visualization tool (cbioportal) and their link to some known oncogenic pathways or network using String-db . Out of 6 genes listed in Table 8, ANO1 was associated to

CFTR gene with text mining and experimental evidence as a source of interaction whereas other genes were not stongly associated with any known oncogenic pathways.

38

KMT2C: Lysine Methyltransferase 2C, also known as MLL3, is involved in histone methylation methylates Lys-4 of histone H3 which represents a specific tag for epigenetic transcriptional activation.

Figure 16: KMT2C mutations visualized using Mutation Mapper. Plant Homeodomain (PHD) in KMT2C were highlighted in blue, red and green. Phenylalanine and Tyrosine rich (FYR) domains were highlighted in yellow and pink. SET domain which methylates H3 with Lys-4 is colored orange.

This gene is an SMG in our FFPE African American cohort and TCGA showing

~17% samples mutated in UH-FFPE, ~10% in TCGA-AA and 2.2% in TCGA-EA making it specific to African American ethnicity. Out of 7 mutations in UH-FFPE, 5 were missense, 1 was frameshift deletion and 1 was inframe deletion on plant homeodomain

(PHD) zinc finger. Frameshift and In-frame deletions across the length of gene likely suggest loss of function.

Out of 5 missense mutations, mutation causing amino acid change at position

2470 from valine to isoleucine (V2470I) was considered benign by SIFT (0.46) and

PolyPhen2 (0.451) whereas, mutations causing amino acid change P27R, E864G,

39

I4055M and R4145C as probably damaging with probability in range (0.975,1) for

MutationTaster and PolyPhen2 and pr<0.05 for SIFT as shown in Table 8.

ANO1: Anoctamin 1, functions as a Calcium-activated chloride channel (CaCC) expressed in various tissues including smooth muscles, secretory epithelia, and sensory neurons [23]. GO annotations related to this gene include protein homodimerization activity supported by dimerization domain of Ca+-activated chloride-channel, Anoct- dimer and intracellular calcium activated chloride channel activity supported by

Anoctamin as shown in Figure 17.

Figure 17: ANO1 mutations visualized using Mutation Mapper. Red bar shows calcium

activated chloride activity domain and Green bar shows dimerization domain.

ANO1 showed 5 missense mutations, 2 at Anoct_dimer domain and 3 showing up at calcium-activated chloride channel domain network indicating evolutionary conservation of the mutated positions hence is more likely to exhibit oncogenic behavior.

40

Mutation showing amino acid change (R307S) at anoct_dimer shows benign effect with PolyPhen2 and MutationTaster, whereas remaining mutations (R226Q,

F628L, R762H) were categorized as probably damaging with probability in range

(0.99,1). Mutation causing change of amino acids at position 704 from glutamate to lysine (E704K) at anoctamin domain was considered benign by Polyphen2 (0.06) and damaging by MutationTaster (high confidence). SIFT score for ANO1 was not provided while annotating variants.

There is an evidence that CFTR expression is found significantly downregulated in primary human breast cancer samples, and is closely associated with poor prognosis in different cohorts of breast cancer patients [24]. Also, breast cancer progression is promoted by ANO1 by activating EGFR and CAMK signaling [25].

Therefore, we provided in-depth analysis of mutation distribution across two ethnicities which was not found to be significantly different and evaluated genes that were differentially mutated across discovery dataset. Mutations in CRIPAK were found to be an FFPE artifact due to which this gene was excluded from further analysis. Top hits

(KMT2C, ANO1) from discovery cohort were analyzed for association to known oncogenic pathways and insilico prediction of variant effects.

41

CHAPTER 4

DISCUSSION

The central question, to evaluate if any differences exist in the somatic mutational profiles of Triple-Negative Breast Cancer arising in African American and European

American ethnicities was addressed in Chapter 3. The inference from results obtained and potential pitfalls that might impact or vary results observed in this analysis are discussed in this Chapter.

We observed FFPE samples depicted significantly (p<0.003) higher read duplication rate and low reported base quality scores in base substitution event when compared to Frozen samples. It can be inferred that FFPE samples are more prone to artifacts when compared to frozen samples making it challenging for molecular studies.

Minor differences in variables such as time of tissue fixation, time requires for paraffin infiltration (processing time), instrumentation and methodology, sample handling, work environment and demographics of tissue being collected can introduce significant variations calling into the question of quality and reliability of downstream assay results.

Analysis of FFPE tissue at 60 days of fixation might depict different mutational profiles a year later. Therefore, the possibility of results to be an FFPE driven artifact could be a pitfall associated with this study.

42

Even after controlling for FFPE artifacts in preprocessing steps, there are challenges in calling somatic mutations which include insufficient tumor read depth, allele frequency, normal-matched sample and tumor heterogeneity. To address these issues we applied multi-caller strategy so as to increase certainty of calls being a true variant (less likely to be a false positive) if called by more than one somatic mutation caller. For MuTect2 and Varscan2, we observed percentage of overlapping calls in the range 15-20% indicative of low concordance. As we observed a high fraction of overlapped calls in Mutect2 (70-75%), results from Mutect2 were used for further analysis. Despite, MuTect2 being more specific and showed best performance with low false positives we anticipate our results might include false positives linked to a drawback of relying upon one caller. Another measure taken to reduce the chance of

Somatic, Non-synonymous, base substitutions, indel mutations being false positives, was implementation of stringent threshold at Tumor coverage (>=30) and Tumor allele count

(>=2) showed a decrease of 20-30% false positives based on coverage and allele count but to cater false positives beyond these filters is an area which needs further development.

We observed no significant differences in tumor mutation burden across the three datasets, discovery, validation and control set from which we concluded that mutation distribution did not differ in the two ethnicities. Next, we wanted to check if there were any differentially mutated genes in between UH-FFPE and TCGA-EA? If yes, how many of those genes show significant mutational frequencies and if these mutations are occurring by chance?

43

We observed 15 differentially mutated genes having p-values<0.05 indicating significant frequency of mutation across these genes. FDR correction did not see any significant differences due to small sample size. As shown in Table 5, discovery dataset showed high frequency of mutation for 15 genes when compared to validation and control dataset. ANO1, KMT2C and CRIPAK being top hits, with nominal p-value extremely significant, showed mutational frequency in the range 11-17% for discovery cohort. ANO1 and CRIPAK did not show any mutational frequencies in validation

(TCGA-AA) and control (TCGA-EA) datasets.

Potential reason stating this difference could include their fate of being FFPE artifacts showing up randomly in discovery set, therefore we performed IGV analysis for our potential candidate genes and found two mutations in CRIPAK as FFPE artifacts hence, was an excellent instance depicting intrusion of false positives even after applying stringent measures. CRIPAK was excluded from further analysis.

Demographic differences in the two sample sets, local sequencing errors, and absence of locally accrued UH-FFPE-EA dataset can be the other potential reasons behind this difference. These mutational frequencies might vary substantially if compared with Cleveland based European American FFPE TNBC samples, described later as future analysis, which could be a unique way of exhibiting demographic variations in the two ethnicities. Comparison of molecular data between two different tissue sets (FFPE and

Frozen) can also impute substantial differences.

44

KMT2C and ANO1, along with four other genes, were found to be significantly mutated across the length of gene in discovery cohort. KMT2C was observed as SMG in discovery and across TCGA datasets with high mutation frequency in discovery and validation dataset making it specific to African American ethnicity. KMT2C due to frameshift mutations across the length of gene likely exhibit loss of function whereas,

ANO1 showed missense mutations likely to exhibit properties of an oncogene.

Additional exploratory analyses which due to time constraints couldn’t be included in this study but can be performed to draw potential insights are described below.

Subnetwork and Pathway Analysis: Genes interact together in various signaling and regulatory pathways and protein complexes. Therefore, genome-scale interactions of mutated subnetworks [26] can be performed to analyze potential subnetworks and regulatory pathways that our candidate genes will be interacting. HotNet2 identifies significantly mutated subnetworks that encompass pathways and complexes with characterized roles in cancer [26].

Implementation of other somatic mutation calling algorithms: In our study, due to less concordance in between MuTect2 and VarScan2, we proceeded with single caller approach involving MuTect2 results for downstream analysis. Other somatic variant callers described in Table 4 can also be applied to see if the frequency of genes mutated across three datasets and variants obtained by MuTect2 falls in the same ballpark when called by other mutation callers. We might not see high concordance again but this would provide a certainty about the frequency of differentially mutated genes.

45

CHAPTER 5

FUTURE DIRECTIONS

We identified genes that were significantly differentially mutated in African

American TNBCs as compared to their European American counterparts using comparative analyses of whole-exome sequencing data derived from an in-house cohort of 36 AA TNBCs and publicly-available datasets.

With mutational profiles in hand, the next important analysis could be to determine context specific mutational frequencies. Mutational processes such as infidelity of DNA replication, exposure to mutagens, post translational modifications, damaged

DNA repair mechanism generate unique combinations of mutation types a.k.a mutation signatures. Mutations can be categorized in AT transitions, AT transversions, CpG transitions, CpG transversions, CG (non-CpG) transitions and transversions, and indel category using Genome MuSic [27]. The probability of observed mutation in each of these categories of contexts can be calculated using binomial distribution. Significance of mutational frequencies in categories defined above can be tested using hypergeometric test at a given context-specific background mutation rate. This analysis will help determine whether specific mutational processes contribute to the pathogenesis of

TNBCs in African Americans.

46

Our comparison of mutational landscapes in the discovery AA TNBC, control

(EA TNBC from the TCGA) and validation (AA TNBC from the TCGA) datasets revealed 15 differentially mutated genes exhibiting significantly higher mutation rates in

AA TNBCs as shown in Table 5. First, orthogonal assessments can be performed for these candidate genes in discovery dataset (36 FFPE African American TNBCs samples) using Targeted gene sequencing panels. Targeted resequencing enables higher depth, high sensitivity (allele-specific limit of detection) <1% and ability to discover novel variants [28]. Subsequently, these genes can be validated using Targeted gene sequencing on locally sourced 300 FFPE TNBC samples comprising of 146 AA and 156 EA population. Sequencing data preprocessing and variant calling can be done as per workflow described in Figure 2. This validation set will also allow us to check if the observed differences in mutational frequencies are due to non-intrinsic factors such as geographical variations from where samples are accrued, different sequencing platforms or tumor tissue type (FFPE vs Fresh Frozen).

Selected candidates can be additionally evaluated using functional studies to assess for molecular and phenotypic relevance of these genes. For example, to verify change in Chloride channel activity or behavior due to mutations in ANO1, patch-Clamp assay on mutated ANO1 expressing cell lines can be used. Next, to verify effects of ANO1 mutations on cell proliferation and viability, first, we need to check if

ANO1 has a role in promoting cell proliferation in breast cancer. This can be tested by treating breast cancer cell lines with ANO1 inhibitor, which could be suggestive of ANO1 biochemical activity thereby confirming its role in the promotion of cell proliferation in breast cancer. Once we are sure of its role in breast cancer cell proliferation, cell lines

47

expressing ANO1 can be imputed with mutations to determine if these mutations drastically increase cell proliferation or not. These studies could help us understand the contribution of ANO1 in breast cancer growth.

Secondly, to test the impact of mutations on methyltransferase activity of KMT2C we can perform Methylated DNA Immunoprecipitation Sequencing (MeDIP) on DNA extracted from wild-type breast cell line and a breast cancer cell line with imputed mutations in KMT2C. The regions with methylated cytosine in wild-type DNA may not be methylated in mutant DNA showing thymine at those sites. Results from high- throughput sequencing yield patterns of methylation in mutant and wild-type cells which can be compared to elucidate differences in segments of DNA resulting from bisulfate conversion. These analyses will provide target genes being affected by mutations in

KMT2C. Gene expression studies using RT-PCR can later be performed on identified targets to determine effects of changes in transcriptional activity due to mutations in

KMT2C.

Somatic copy-number alterations could be another potential contributor to differences in the observed cancer burdens in AA versus EA TNBCs. To identify copy number differences in between AA and EA TNBCs, we can employ a robust computational framework, ENVE (Extreme Value Distribution Based Somatic Copy-

Number Variation Estimation) [29], that can determine somatic copy-number alterations using whole-exome sequencing data. Accordingly, we can apply ENVE to analyze the

WES profiles of the 36 AA (discovery cohort) and a local control set of 36 FFPE

European American TNBC samples. Candidate sCNAs can be further validated using our locally accrued validation cohort comprising 146 AA and 156 EA tumor samples.

48

Candidate sCNAs identified using ENVE, can be validated using qPCR-based assessments and/or microarray-based assays such as Affymetrix OncoScan arrays [30].

Yet another source of potential biologic differences between AA and EA TNBCs could be due to differential regulation of DNA Methylation profiles. In fact, KMT2C one of the differentially mutated genes in our study, is known to influence DNA methylation, thus suggesting that there indeed could be epigenetic differences between AA and EA

TNBCs. Accordingly, we can perform genome-scale DNA Methylation profiling on the discovery cohort (36 AA FFPE) and a matched local cohort of 36 FFPE EA TNBCs, using methylation arrays such as the Illumina EPIC. Candidate differentially methylated loci in AA versus EA TNBCs can then be orthogonally validated by methylation specific

PCR, thus providing additional complementary mechanisms likely contributing to differential cancer burdens between AA and EA TNBCs.

Of note, each the functional effects of each of the above measured genomic and epigenetic aberrations are likely observable in the gene expression profile of respective tissues. Therefore, we can perform genome-scale transcriptome profiling on the discovery cohort (36 AA FFPE) and a matched local cohort of 36 FFPE EA TNBCs. Select differentially expressed genes can then be using either targeted RNA sequencing or RT-

PCR on the 300 archival FFPE TNBC samples from Northeast Ohio (146 African

American and 156 European American). Additionally, given that prior studies have identified recurrent clinically-significant expression-based subtypes within TNBCs

(basal-like1, basal-like2, immunomodulatory, mesenchymal, mesenchymal stem like and luminal androgen receptor) [31], we can evaluate whether the above aberrations show any association with any of these known subtypes of TNBC. Also, given some prior

49

evidence that the BL1 subtype is more prevalent in AA TNBCs [32], we can assess for differences in subtype-frequencies using the 300 archival FFPE TNBC samples from

Northeast Ohio (146 African American and 156 European American).

Each of the above methods described can provide information on individual molecular mechanisms specific to AA TNBC. However, it is possible that the individual genomic, epigenomic and transcriptomic aberrations all contribute to convergent dysregulation of specific signaling networks in AA versus EA TNBCs. It is therefore important to perform an integrative analysis of these various omics profiles using robust computational frameworks that can provide additional insights into the underlying signaling network deregulations between AA and EA TNBCs. One such methodology, that was developed in our laboratory, is InFlo [33], which can capture patient-specific pathway deregulations by integrating multi-omics data including mutations, CNVs, gene- expression and epigenetic profiles. Thus, we can use InFlo to integrate the above described individual molecular profiles to determine signaling sub-networks that are differentially activated in AA versus EA TNBCs.

Taken together, all of these additional analyses and validation studies will provide a comprehensive view of the intrinsic biologic differences between AA and EA TNBCs, which can then be assessed for associations with clinical outcome, thus providing much needed insights in the as yet unidentified biologic factors promoting poor clinical outcomes in TNBCs patients with African American ancestry.

50

CHAPTER 6

CONCLUSION

In summary, this study revealed differences in somatic mutation profiles of

African American Triple-Negative Breast Cancer patients when compared to European

American ethnicity. Other than somatic mutational profiles we draw potential differences in FFPE and Frozen sample set used. We implemented multi-caller somatic mutation calling algorithms but observed low concordance in calls called by individual callers.

Tumor mutation burden in triple-negative breast cancer across two ethnicities (AA & EA) did not show any significant differences. KMT2C and ANO1 were found to be significantly differentially mutated with mutational frequencies of 17% and 14% respectively. KMT2C was found to be significantly mutated across validation dataset, with frequency 10% whereas in control dataset percentage of samples mutated for

KMT2C was 2.2% confirming its alliance to AA ethnicity. ANO1 was only mutated in the discovery cohort. Differences in mutational frequencies, due to intrusion of FFPE artifacts, demographics, sequencing platform etc, will be addressed by validation set comprising of platform, tissue and demographics matched, 300 locally accrued AA and

EA FFPE TNBC samples. Additional exploratory analyses that can be done include context specific mutational frequencies, identifying significantly mutated subnetworks and regulatory pathways associated to candidate genes. Taken together our results indicate presence of novel genes that can potentially highlight unsuspected genetic mechanisms promoting worse prognosis of Triple-Negative Breast cancers arising in

African American women.

51

REFERENCES

[1] C. E. DeSantis, J. Ma, A. Goding Sauer, L. A. Newman, and A. Jemal, “Breast cancer statistics, 2017, racial disparity in mortality by state,” CA Cancer J Clin, vol. 67, no. 6, pp. 439–448, Nov. 2017.

[2] C. M. Perou et al., “Molecular portraits of human breast tumours,” Nature, vol.

406, no. 6797, pp. 747–752, Aug. 2000.

[3] A. Prat and C. M. Perou, “Deconstructing the molecular portraits of breast cancer,” Mol Oncol, vol. 5, no. 1, pp. 5–23, Feb. 2011.

[4] O. Yersal and S. Barutca, “Biological subtypes of breast cancer: Prognostic and therapeutic implications,” World J Clin Oncol, vol. 5, no. 3, pp. 412–424, Aug. 2014.

[5] A. M. Badowska-Kozakiewicz and M. P. Budzik, “Immunohistochemical characteristics of basal-like breast cancer,” Contemp Oncol (Pozn), vol. 20, no. 6, pp.

436–443, 2016.

[6] R. Greenup et al., “Prevalence of BRCA mutations among women with triple- negative breast cancer (TNBC) in a genetic counseling cohort,” Ann. Surg. Oncol., vol.

20, no. 10, pp. 3254–3258, Oct. 2013.

[7] R. Nanda et al., “Genetic testing in an ethnically diverse cohort of high-risk women: a comparative analysis of BRCA1 and BRCA2 mutations in American families of European and African ancestry,” JAMA, vol. 294, no. 15, pp. 1925–1933, Oct. 2005.

[8] C. I. Szabo and M. C. King, “Population genetics of BRCA1 and BRCA2.,” Am J

Hum Genet, vol. 60, no. 5, pp. 1013–1020, May 1997.

52

[9] T. Akinyemiju, J. X. Moore, and S. F. Altekruse, “Breast cancer survival in

African-American women by hormone receptor subtypes,” Breast Cancer Res. Treat., vol. 153, no. 1, pp. 211–218, Aug. 2015.

[10] F. O. Ademuyiwa, Y. Tao, J. Luo, K. Weilbaecher, and C. X. Ma, “Differences in the mutational landscape of triple-negative breast cancer in African Americans and

Caucasians,” Breast Cancer Res. Treat., vol. 161, no. 3, pp. 491–499, 2017.

[11] L. A. Newman et al., “African-American ethnicity, socioeconomic status, and breast cancer survival: a meta-analysis of 14 studies involving over 10,000 African-

American and 40,000 White American patients with carcinoma of the breast,” Cancer, vol. 94, no. 11, pp. 2844–2854, Jun. 2002.

[12] K. R. Bauer, M. Brown, R. D. Cress, C. A. Parise, and V. Caggiano, “Descriptive analysis of estrogen receptor (ER)-negative, progesterone receptor (PR)-negative, and

HER2-negative invasive breast cancer, the so-called triple-negative phenotype: a population-based study from the California cancer Registry,” Cancer, vol. 109, no. 9, pp.

1721–1728, May 2007.

[13] M. J. Lund et al., “Race and triple negative threats to breast cancer survival: a population-based study in Atlanta, GA,” Breast Cancer Res. Treat., vol. 113, no. 2, pp.

357–370, Jan. 2009.

[14] E. C. Dietze, C. Sistrunk, G. Miranda-Carboni, R. O’Regan, and V. L. Seewaldt,

“Triple-negative breast cancer in African-American women: disparities versus biology,”

Nat. Rev. Cancer, vol. 15, no. 4, pp. 248–254, 2015.

[15] A. Colaprico et al., “TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data,” Nucleic Acids Res., vol. 44, no. 8, p. e71, 05 2016.

53

[16] D. Huo et al., “Comparison of Breast Cancer Molecular Features and Survival by

African and European Ancestry in The Cancer Genome Atlas,” JAMA Oncol, vol. 3, no.

12, pp. 1654–1662, 01 2017.

[17] M. Srinivasan, D. Sedmak, and S. Jewell, “Effect of Fixatives and Tissue

Processing on the Content and Integrity of Nucleic Acids,” Am J Pathol, vol. 161, no. 6, pp. 1961–1971, Dec. 2002.

[18] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-

Wheeler transform,” Bioinformatics, vol. 25, no. 14, pp. 1754–1760, Jul. 2009.

[19] L. Cai, W. Yuan, Z. Zhang, L. He, and K.-C. Chou, “In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data,” Sci Rep, vol. 6, p. 36540, Nov. 2016.

[20] A. D. Ewing et al., “Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection,” Nat. Methods, vol. 12, no. 7, pp.

623–630, Jul. 2015.

[21] A. B. Krøigård, M. Thomassen, A.-V. Lænkholm, T. A. Kruse, and M. J. Larsen,

“Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in

Exome and Targeted Deep Sequencing Data,” PLoS ONE, vol. 11, no. 3, p. e0151664,

2016.

[22] Z. R. Chalmers et al., “Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden,” Genome Med, vol. 9, no. 1, p. 34, 19 2017.

[23] Y. D. Yang et al., “TMEM16A confers receptor-activated calcium-dependent chloride conductance,” Nature, vol. 455, no. 7217, pp. 1210–1215, Oct. 2008.

54

[24] J. T. Zhang et al., “Downregulation of CFTR promotes epithelial-to-mesenchymal transition and is associated with poor prognosis of breast cancer,” Biochim. Biophys.

Acta, vol. 1833, no. 12, pp. 2961–2969, Dec. 2013.

[25] A. Britschgi et al., “Calcium-activated chloride channel ANO1 promotes breast cancer progression by activating EGFR and CAMK signaling,” Proc. Natl. Acad. Sci.

U.S.A., vol. 110, no. 11, pp. E1026-1034, Mar. 2013.

[26] M. D. M. Leiserson et al., “Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes,” Nature Genetics, vol.

47, no. 2, pp. 106–114, Feb. 2015.

[27] N. D. Dees et al., “MuSiC: Identifying mutational significance in cancer genomes,” Genome Research, vol. 22, no. 8, p. 1589, Aug. 2012.

[28] “targeted-resequencing-guide-770-2016-012.pdf.” .

[29] V. Varadan et al., “ENVE: a novel computational framework characterizes copy- number mutational landscapes in colorectal cancers from African American patients,”

Genome Med, vol. 7, no. 1, Jul. 2015.

[30] G. Ciriello, M. L. Miller, B. A. Aksoy, Y. Senbabaoglu, N. Schultz, and C.

Sander, “Emerging landscape of oncogenic signatures across human cancers,” Nat.

Genet., vol. 45, no. 10, pp. 1127–1133, Oct. 2013.

[31] B. D. Lehmann et al., “Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies,” J. Clin. Invest., vol.

121, no. 7, pp. 2750–2767, Jul. 2011.

55

[32] R. Lindner et al., “Molecular phenotypes in triple negative breast cancer from

African American patients suggest targets for therapy,” PLoS ONE, vol. 8, no. 11, p. e71915, 2013.

[33] N. Dimitrova et al., “InFlo: a novel systems biology framework identifies cAMP-

CREB1 axis as a key modulator of platinum resistance in ovarian cancer,” Oncogene, vol. 36, no. 17, pp. 2472–2482, Apr. 2017.

56