A Large-Scale, Exome-Wide Association Study of Han Chinese Women Identifies Three

Author Manuscript Published OnlineFirst on March 23, 2018; DOI: 10.1158/0008-5472.CAN-17-1721 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

A large-scale, exome-wide association study of Han Chinese women identifies three

novel loci predisposing to breast cancer

Bo Zhang1,2, Men-Yun Chen2, Yu-Jun Shen3, Xian-Bo Zhuo3, Ping Gao4, Fu-Sheng

Zhou3, Bo Liang3, Jun Zu3, Qin Zhang4, Sufyan Suleman4, Yi-Hui Xu2, Min-Gui Xu2,

Jin-Kai Xu2, Chen-Cheng Liu2, Nikolaos Giannareas4, Ji-Han Xia4, Yuan Zhao3,

Zhong-Lian Huang1, Zhen Yang1, Huai-Dong Cheng1, Na Li1, Yan-Yan Hong1, Wei

Li1, Min-Jun Zhang1, Ke-Da Yu5, Guoliang Li6, Meng-Hong Sun5, Zhen-Dong Chen1,

Gong-Hong Wei4 & Zhi-Min Shao5

1Department of Oncology, No.2 Hospital, Anhui Medical University, Hefei, Anhui,

China

2School of Life Sciences, Anhui Medical University, Hefei, Anhui, China

3State Key Laboratory Incubation Base of Dermatology, Ministry of National Science

and Technology, Hefei, Anhui, China

4Biocenter Oulu, Faculty of Biochemistry and Molecular Medicine, University of

Oulu, Oulu, Finland

5Department of Breast Surgery, Fudan University Shanghai Cancer Center/Cancer

Institute, Shanghai, China

6Bio-Medical Center, College of Informatics, Huazhong Agricultural University,

Wuhan, China

Downloaded from cancerres.aacrjournals.org on September 26, 2021. © 2018 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 23, 2018; DOI: 10.1158/0008-5472.CAN-17-1721 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Running title: A large exome-wide association study of Han Chinese women

Corresponding Authors: Bo Zhang, Department of Oncology, No.2 Hospital, Anhui

Medical University, Hefei, Anhui, China. Phone: +86-551-62923252; E-mail:

[email protected]; Gong-Hong Wei, Biocenter Oulu, Faculty of Biochemistry and

Molecular Medicine, P.O.Box 5000, FIN-90014 University of Oulu, Finland. Phone:

+358-504288121; E-mail: [email protected]; and Zhi-Min Shao, Fudan

University Shanghai Cancer Center, Shanghai, China. E-mail:

[email protected]

Conflict of interest: The authors disclose no potential conflicts of interest.

Abstract

Genome-wide association studies have identified more than 90 susceptibility loci for

breast cancer. However, the missing heritability is evident, and the contributions of

coding variants to breast cancer susceptibility have not yet been systematically

evaluated. Here we present a large-scale whole-exome association study for breast

cancer consisting of 24,162 individuals (10,055 cases and 14,107 controls). In addition

to replicating known susceptibility loci (e.g. ESR1, FGFR2 and TOX3), we identify two

novel missense variants in C21orf58 (rs13047478, Pmeta = 4.52 × 10-8) and ZNF526

(rs3810151, Pmeta = 7.60 × 10-9) and one new non-coding variant at 7q21.11 (P < 5 ×

10-8). C21orf58 and ZNF526 possessed functional roles in the control of breast cancer

cell growth, and the two coding variants were found to be the eQTL for several nearby

genes. rs13047478 was significantly (P < 5.00 x 10-8) associated with the expression of

genes MCM3AP and YBEY in breast mammary tissues. rs3810151 was found to be

significantly associated with the expression of genes PA FA H 1 B 3 (P = 8.39 x 10-8) and

CNFN (P = 3.77 x 10-4) in human blood samples. C21orf58 and ZNF526, together with

these eQTL genes, were differentially expressed in breast tumors versus normal breast.

Our study reveals additional loci and novel genes for genetic predisposition to breast

cancer and highlights a polygenic basis of disease development.

Significance: Large-scale genetic screening identifies novel missense variants and a

noncoding variant as predisposing factors for breast cancer.

Introduction

Breast cancer is the most common type of cancer and the leading cause of cancer deaths

in women worldwide (1). Morbidity and mortality associated with breast cancer have

increased rapidly in China (2). Although the precise mechanisms underlying this

heterogeneous disease have not been fully elucidated, increasing evidence indicated

that common genetic variants may contribute to the heritable risk of breast cancer (3).

Our understanding of the genetic architecture of breast cancer has been rapidly

increased through genome-wide association studies (GWASs), which have identified

numerous breast cancer risk-associated variants within more than 90 susceptibility loci

(www.genome.gov/gwastudies). However, these variants only explain a small

proportion of the genetic variation in breast cancer. The “missing heritability” in breast

cancer is evident (4, 5). Furthermore, most of the previously identified variants related

to breast cancer susceptibility are located in non-coding genomic regions (6) and thus

provide few clues to the functional mechanisms through which these variants affect

susceptibility to the disease. Analysis of coding variation could provide more direct

biological and functional interpretation for etiology of disease. To assess the role of

coding variants with high penetrance that were poorly covered in conventional GWAS

may contribute to identifying the “missing heritability” in polygenic disorders (7-9).

Recent technological advances in high-throughput sequencing (10) have provided

an opportunity to re-sequence multiple genetic regions. Such studies have generated

compelling evidence that coding variants contribute to the mechanisms of breast

cancer(4) and other complex disorders (11-16). Recently, studies employing new exome 4

chips have demonstrated that such chips can be used to comprehensively identify

coding variants for several complex traits. Due to the relatively high cost of high-

throughput sequencing, exome chips provide a cost-effective method for the

investigation of coding variants. In this study, we sought to identify novel genetic loci

predisposing to breast cancer using exome chips in a Chinese population.

Materials and Methods

Study samples

We implemented a two-stage case-control design in this study. The subjects, consisting

of 10,055 cases and 14,107 healthy controls, were enrolled through a collaborative

consortium in China (Table 1). All cases were diagnosed by at least two pathologists,

and their clinical information was collected through a comprehensive clinical check-up

by professional investigators. Additionally, demographic information was collected

from all participants through a structured questionnaire. All of the healthy controls were

clinically determined to be without breast cancer, a family history of breast cancer and

(including first-, and second-degree relatives). All cases and controls were female. All

samples were self-reported Han Chinese. Written, informed consent was given by all

participants. The study was approved by the institutional ethics committee of each

hospital and was conducted according to the Declaration of Helsinki principles.

Exome array and genotyping

In this study, we used custom Illumina Human Exome Asian BeadChip (Exome_Asian

Array). The platform includes 242,102 markers focused on putative functional coding

variants from > 12,000 exome and genome sequences representing multiple ethnicities

and complex traits in addition to 30,642 Chinese population-specific coding variants,

identified by whole exome sequencing performed in 676 controls by our group (17).

The details of the SNP content and selection strategies are described on the exome array

design webpage (http://genome.sph.umich.edu/wiki/Exome_Chip_Design). 6

In this study, a cohort including 16,066 samples (8,031 cases and 8,035 controls) was

genotyped using the Exome_Asian array. The genotyping was conducted at the State

Key Lab Incubation Base of Dermatology, Ministry of National Science and

Technology (Anhui Medical University). The genotype calling and the clustering of

study sample genotypes were performed using Illumina’s GenTrain (version 1.0)

clustering algorithm in Genome Studio (version 2011.1).

Quality controls

We excluded samples with genotyping call rates ＜98% in the first stage. Then we

examined potential genetic relatedness based on pairwise identity by state (IBS) for all

the successfully genotyped samples using PLINK 1.07 software (18). On the

identification of a first- or second-degree relative pair, we removed one of the two

related individuals (the sample with the lower call rate was removed). We defined close

relatives as those for whom the estimated genome-wide identity-by-descent proportion

of alleles shared was > 0.10. In total, 452 samples (388 cases and 64 controls) were

removed due to sample duplication and genetic relatedness. The remaining samples

were subsequently assessed for population outliers and stratification using a PCA-based

approach (19). For all PCA, all HLA SNPs on chr.6 and SNPs on non-autosomes were

removed (Supplementary Figure 1). Furthermore, we excluded SNPs with a call rate

< 99%, a minor allele frequency (MAF) < 0.01 and/or a significant deviation from

Hardy-Weinberg equilibrium (HWE) in the controls (P< 10-4) during each stage. For

quality control, 272,744 variants (Exome_Asian Array) were included. After quality 7

control, the genotype data of 33,347 autosomal variants in 8031 cases and 8035 controls

were included for further analysis.

SNP selection and genotyping for replication

To replicate the association results of the Exome_Asian array, we further analysed the

60 top variants in an additional 8,096 samples (2,024 cases and 6,072 controls)

(Supplementary Table S1) using the Sequenom MassARRAY system (Sequenom, Inc.,

USA) and Multiplex SnapShot technology (Applied Biosystems, Inc.,USA) (six of

these SNPs failing for primer design). All of these selected SNPs met the following

quality criteria: 1) the MAF was higher than 1% in both the cases and controls; 2) HWE

in the controls was P ≥ 0.01 and the HWE in the cases was P > 10-4; 3) in each locus,

one or two of the most significant SNPs were selected for validation.

Statistical analyses

Single variant association analyses were performed to test for disease-SNP associations,

assuming an additive allelic effect and using logistic regression in each stage. The

Cochran-Armitage trend test was conducted in these two-stage samples. We performed

heterogeneity tests (I2 and P values of the Q statistics) between the two groups using

the Breslow-Day test (20), and the extent of heterogeneity was assessed using the I2

index (21). To improve the statistical power, we combined the association results in two

stages using meta-analysis. The fixed effect model (Mantel-Haenszel) (22) was applied

when I2 was less than 30%. Otherwise, the random effect model (DerSimonian-Laird) 8

(23) was implemented.

Cell culture

The experiments were performed using MCF7, MDA-MB-231 and T47D breast cancer

cell lines that were originally purchased from American Type Culture Collection

(ATCC) in 2011, and regularly tested for mycoplasma. Only mycoplasma negative cells

were used for experimentation. All the cell lines are usually used for 4 to 9 passages

from the initial expansion and frozen down. MCF7 cells were grown in RPMI1640

medium (Sigma), and MDA-MB-231 was grown in low glucose DMEM (21885025,

Invitrogen). All mediums used for cell culture were supplemented with 10% FBS and

1% penicillin and streptomycin (Sigma). All cells were grown at 37°C with 95% air and

5% CO2.

siRNA transfection

siRNA used in this experiment were purchased from Qiagen and can be found in

Supplementary Table S2. The siRNA knockdown assay was performed as described

previously (24). In brief, 50-60% confluent MDA-MB-231 cells were seeded in 6-well

plates. 12 hours later, we performed transfection with siRNA following the instruction

of HiPerFect Transfection Reagent (301705, Qiagen). The cells were collected for RNA

purification in 48 hours.

Quantitative real time PCR 9

PureLink® RNA Mini Kit (12183018A, Invitrogen) was used to isolate RNA from cells.

The DNA was removed by RNase-Free DNase (79254, QIAGEN). High-Capacity

cDNA Reverse Transcription Kit (4368814, Applied Biosystems) was applied to

synthesize cDNA from RNA. Quantitative RT-PCR reactions were performed by using

the SYBR Select Master Mix (4472908, Applied Biosystems). We selected two high

specificity primers for each target. Primer sequences used in this experiment can be

found in Supplementary Table S3. For the determination of mRNA levels of each

genes, three replications of each gene were performed and the data was normalized

against an endogenous ACTB (β-actin) control.

Plasmids, gene and SNP region cloning, and Site-directed mutagenesis

The cDNA of human C21orf58 or ZNF526 gene was amplified from a human cDNA

library and cloned into pcDNA3.1-V5 vector. Primer sequences are shown

in Supplementary Table S3. Site-directed mutagenesis was also made to obtain the G

allele at the rs3810151 site of ZNF526 cDNA and the A allele at the rs13047478 site of

C21orf58 cDNA in the pcDNA3.1-V5 constructs.

For cloning SNP and promoter regions, the pGL3 basic and pGL3 or pGL4 promoter

vectors (Promega) were used, both vectors encode luciferase-reporter gene luc2

(Photinus pyralis). Experimental inserts of 787bp for rs13047478 (chr21: 47734341-

47735127, GRCh37/hg19) and 751bp for rs3810151 (chr19: 42728444-42729194,

GRCh37/hg19) were amplified from the VCaP genomic DNA using the cloning primers

listed in Supplementary Table S3. The inserts were cloned upstream of luciferase gene 10

into pGL3 promoter vectors, the insert sequences and alleles of SNPs rs13047478 and

rs3810151 were confirmed by sequencing. The allele determined were as rs13047478-

A and rs3810151-A. For the measurement of the allele specific enhancer activity, the

determined alleles were mutated to rs13047478-G and rs3810151-G by using site-

directed mutagenesis primers (Supplementary Table S3). In order to eliminate the

possibility of enhancer activity from regions other than SNP containing regions, two

control regions from C21orf58 gene were selected and cloned into pGL3 promoter

vector. The fragment I is an intergenic region between YBEY and C21orf58, fragment

II is a random intronic region of gene C21orf58.

To test and validate the alleles specific impact of rs13047478 on gene specific promoter

regions, MCM3AP and YBEY promoter regions were cloned downstream of SNP

regions and upstream of luciferase gene into pGL3 basic vector using the primers listed

in (Supplementary Table S3). In addition, promoter regions of MCM3AP and YBEY

were also cloned into pGL3 basic vector to run as control along with the empty pGL3

basic vector. All cloning inserts were confirmed by sequencing. In all reporter assays a

renilla luciferase reporter vector pGL75 (Promega) was used as an internal control to

compare transfection efficiency.

Transient Transfection

For plasmid transfection (pcDNA3.1-V5-C21orf58/ZNF526) on 6-well culture plates,

1.5 × 106 breast cancer MCF7 cells per well were applied. Transient Transfections were

applied using Lipofectamine™ 3000 Transfection Reagent (Thermofisher scientific) 11

following the manufacturer's instructions. After 48hr, cells were harvest for protein blot

analysis.

Western blot analysis

Cell lysate was prepared in lysis buffer (600 mM NaCl, 1% Triton X-100 in PBS).

Protein samples were denatured in 1× SDS loading buffer (Thermo Scientific) and 100

mM DTT, separated by 10% SDS-PAGE and blotted onto 0.45-μm PVDF transfer

membrane (Immobilon-P, Millipore) with a Semi-Dry transfer cell (Trans-Blot SD,

Bio-Rad). Membranes were blocked with 5% nonfat milk (Cell Signaling Technology)

in TBST (50 mM Tris-HCl, pH 7.5, 150 mM NaCl, 0.05% Tween-20) and then exposed

to antibodies (1:5,000 dilution) targeting V5 tag (Monoclonal Antibody, HRP, R961-

25, Invitrogen) and actin (ab20272, abcam). Membranes were developed with

SuperSignal West Femto Maximum Sensitivity Substrate (34095, Thermo Scientific).

Membranes were imaged with a LAS-3000 Luminescent Image Analyzer (FujiFilm).

Cell viability and proliferation assays

HiPerFect Transfection Reagent (301705, Qiagen) was used for reverse transfection of

control siRNA (1027281), C21orf58 siRNA (SI04208631, SI04282838, SI04314142

and SI04367083), and ZNF526 siRNA (SI00775593, SI04130966, SI04205859 and

SI04230422) into MDA-MB-231 cells. The detailed protocol was as described

previously48. Briefly, 6uM siRNA was diluted in 18ul optiMEM, and then 1.5ul

HiPerFect Transfection Reagent was added and mixed for each well of 96-well plate.

After incubation of 10min, cells (2.5×103 per well) were added. We added XTT

(11465015001, Roche) reagent at the time point of 3 days and 5 days, measured the

absorbance at 450nm according to the instruction of the manufacturer. Data was

collected from five replicate wells and analysed by the two-tailed t test to determine the

significances between different target cells at each time point.

Transfection and enhancer reporter assay

For enhancer reporter assay, the cells were cultured in white 96-well plate with 100 µl

suspension of MCF7 at 4×105 cells/ml per well. Cells were reverse transfected

according to the manufacture´s protocol with 100 ng of experimental and control

plasmids, and 4 ng of pGL75 as an internal control/well using X-tremeGENE™ HP

DNA Transfection Reagent (Roche) and incubated at 37 °C. After 48 hours the

luciferase activity was measured using the Dual-Glo® Luciferase Assay System

(Promega) by following the manufacture protocol. In all transfection reactions, three

replicates were made, which were compared with control plasmid for enhancer activity

determination.

Differential gene expression analysis of the clinical breast cancer data sets

The differential gene expression analyses were performed in order to identify which

transcripts/genes from the breast cancer tumor samples were being produced at a

significantly higher or lower level than that in the healthy tissues. Our study involved 13

a batch of datasets from Oncomine (25) and TCGA cohorts (26). Genes with missing

expression value in more than 50% of total samples were not taken for analysis. Mann-

Whitney U test was employed to investigate the differential gene expression between

the tumor samples and normal samples. Kruskal–Wallis H test was also applied in

certain datasets where more than two groups were available. Figures were produced

with R (27).

Survival analysis

To investigate the association of the expression of certain genes and the overall survival

or biochemical relapse rate in breast cancer, we analyzed a batch of datasets from

Oncomine (25) and TCGA (26). Gene expression values from Oncomine are

microarray-based while RNA-seq-based in TCGA.

The Kaplan-Meier survival function was applied in the survival analyses. The idea is

to define the probability of surviving to a certain time period. The survival probability

or biochemical event free probability at any particular time period is calculated by the

formula shown below:

In our analyses, we did not take into account the genes whose expression data were

missing in more than 50% of total samples. We used the average gene expression value

as a measurement to stratify tumor samples into two groups. The strategy for

stratification is illustrated as follows:

Higher expression: Expression of gene X in subject i > mean (gene X in all subjects

k) 14

Lower expression: Expression of gene X in subject i < mean (gene X in all subjects

Where, i ∈ [1, k]

Compared to the sample mean, subjects with higher gene expression value were defined

as higher expression, while lower expression-subjects were classified as lower

expression category. In other words, subjects with positive/negative scores indicate

higher/lower gene expression compared to the average expression. We then performed

the Kaplan-Meier function based on the stratification, and figures were plotted using R

package “survplot” (27) with modifications fitting to own needs.

Ethics approval and consent to participate

All participants have given written and informed consent. The study was approved by

the institutional ethics committee of each hospitals and was conducted according to the

Declaration of Helsinki principles.

Results

Overview of exome-wide association analyses

To identify novel loci conferring susceptibility to breast cancer, we conducted a large-

scale exome-wide association study using a two-stage case-control design (10,055

cases and 14,107 controls) (Table 1) in a Han Chinese population by using the Illumina

Human Exome_Asian Array (Illumina, Inc., USA), Sequenom MassArray system

(Sequenom, Inc., USA) and Multiplex SnapShot technology (Applied Biosystems, Inc.,

USA). In the discovery stage, 272,744 markers were genotyped in 8,031 cases and

8,035 controls using the Exome_Asian Array (Figure 1). After quality control and

principal component analysis (Online Methods), 33,347 non-MHC single-nucleotide

polymorphisms (SNPs) were identified and selected for further analysis. Quantile-

quantile (QQ) plots and Manhattan plots were generated using the Cochran-Armitage

test for trend (Figure 2 and Supplementary Figure S1-S2). A clear deviation from the

expected null distribution was observed in the QQ plot (Supplementary Figure S2).

Using the genome-wide results from the discovery stage, we first investigated the

evidence for the previously reported GWAS loci in breast cancer. To date, 125 breast

cancer susceptibility SNPs within 98 loci have been discovered at genome-wide

significance (P< 5 × 10-8) (Supplementary Table S1). In this study, twenty-three of

these SNPs were directly covered by the Exome_Asian array and passed our quality

control. Significant associations were observed for 15 known breast cancer risk variants

within 10 loci such as CCDC170 (rs3734805, P = 7.10×10-29), ESR1 (rs2046210, P =

9.08×10-25), TOX3 (rs4784227, P = 3.19×10-19), TNRC9 (rs3803662, P = 4.99×10-11), 16

and FGFR2 (rs2981579, P = 7.80×10-8 and rs1219648, P = 2.69×10-7) (Supplementary

Table S5 and Supplementary Figure S3). Our data also showed nominal association

for another three previously reported SNPs within three loci (ANKLE1 rs2363956,

FGF10 rs4415084, and PTPN22 rs11552449; P < 0.05 each). For the twenty-three

SNPs, most of them also showed effect in the same direction as the previously reported

studies, though some variants displayed no significant associations with breast cancer

risk in our study cohort. Together, these analyses not only confirmed the associations

of these 15 SNPs within 10 independent reported loci with breast cancer in Chinese

population, but also ensure the good quality of the genotype data obtained from the

Exome_Asian Array for our downstream analyses.

Discovery of new susceptibility loci for breast cancer

To identify true genetic factors and novel susceptibility loci for breast cancer, we next

selected the top 60 SNPs with P values of less than 10-6 (Supplementary Table S1)

for the stage 2 of replication study (six of them failing for primer design). These selected

SNPs were further genotyped in an independent replication cohort including 2,024

cases and 6,072 controls. We evaluated these SNPs passed in the replication stage for

achieved nominal association evidence without Bonferroni correction, similar to the

research methods adopted in the previous studies [28-30]. In the process of quality

control stage we also evaluated association heterogeneity for these SNPs in the

discovery and the replication studies. After quality control at the replication stage, 14

SNPs at 14 different loci exhibited significant or nominal association with breast cancer 17

(1.93×10-22

analysis of the SNPs in the discovery (stage 1) and replication (stage 2) studies

- identified two new missense variants, including ZNF526 (rs3810151, Pmeta = 7.60×10

9 -8 ), and C21orf58 (rs13047478, Pmeta = 4.52×10 ). In addition, a new non-coding variant

-9 were identified at 7q21.11 (rs7807771, Pmeta = 6.71×10 ) (Table 2 and Supplementary

Figure S4-S5). Notably, none of these three SNPs exhibited any significant

heterogeneity in the discovery and the replication studies. For the three novel SNPs, we

analysed the Linkage disequilibrium (LD) patterns using the genotyping data (only

SNPs with MAF > 0.01) from our Illumina Human Exome_Asian Array data in the

Haploview (31) (Supplementary Figure S6).

Functional analyses of C21orf58 and ZNF526 in breast cancer

In the meta-analysis of this exome-wide association study, we discovered two novel

coding variants associated with breast cancer, rs13047478 within the exon of C21orf58

(chromosome 21 open reading frame 58) at 21q22.3 and rs3810151 in the zinc finger

protein 526 gene (ZNF526) at 19q13.2. The role of C21orf58 and ZNF526 at these two

loci in breast cancer development remains totally unknown. Using the data of genome-

wide CRISPR-Cas9-based loss-of-function screens in 33 cancer cell lines for the

identification of genes that are essential for cell growth and survival (32), we found that

C21orf58 and ZNF526 were important for the survival of breast cancer cells (Figure

3a,b and Supplementary Figure S7), suggesting that these genes possess unknown

function in the control of breast cancer cell growth. Consistent with this, our cell

proliferation assays showed that breast cancer cells harboring short interfering RNA

(siRNA) against C21orf58 or ZNF526 showed markedly reduced cell growth and

viability compared to cells harboring control siRNA (Figure 3C,D and

Supplementary Figure S8). Furthermore, using the Oncomine analysis tool (25), we

compared the mRNA expression levels of C21orf58 and ZNF526 in the Finak breast

cancer dataset (33), and found that both genes were highly expressed in invasive breast

carcinoma compared to normal breast (Figure 3E). The analysis of several additional

large-scale clinical data sets (26,34,35) showed that C21orf58 and ZNF526 were greatly

upregulated in breast cancer in comparison with normal breast tissues (Figure 3F,G

and Supplementary Figure S9A). Furthermore, high mRNA levels of C21orf58 or

ZNF526 showed marginal associations with poor survival in multiple independent

cohorts of breast cancer patients (36-38) (Supplementary Figure S9B-D). Together,

our analyses reveal a previously unknown role of C21orf58 and ZNF526 in breast

carcinogenesis.

Functional annotation of the variants at the three novel breast cancer susceptibility loci

The SNP rs13047478 at 21q22.3 is located within the exon of C21orf58, which results

in an amino acid change of proline to serine. C21orf58 is an uncharacterized gene, and

its role in breast cancer remains completely unknown. Here we provide experimental

and clinical evidence of a potential role for C21orf58 in breast cell growth and

carcinogenesis (Figure 3 and Supplementary Figure S7-S9). We next cloned

C21orf58 with different alleles of rs13047478 into mammalian expression vector

pcDNA3.1-V5 and examined the effect of rs13047478 on C21orf58 expression by

transient transfection in the breast cancer cell line MCF7 (see Methods). We found that

the expression levels of C21orf58 with the A allele of rs13047478 are approximately

1.5-fold higher than that of C21orf58 with the G allele (Supplementary Figure S10A).

In contrast, we observed no impact of the rs13047478-associated amino acid change on

the function of C21orf58 in the growth control of breast cancer cell line

(Supplementary Figure S10B).

Despite the fact that rs13047478 in C21orf58 is a coding variant functioning on

amino-acid substitution, by querying a large collection of ChIP-seq datasets (39), we

observed the binding of multiple transcription factors and epigenetic features at

rs13047478 region (Supplementary Figure S11), suggesting this SNP-containing

genomic region may be a possible exonic transcriptional regulatory element with

impact on gene expression (40). Consistent with this observation, our enhancer

luciferase reporter assay showed that rs13047478 region may possesses enhancer

activity (Figure 4A). We also mapped rs13047478 within several transcription factor

DNA-binding positional weight matrix (PWMs) derived from HaploReg database

(41,42) (Supplementary Table S7) but found no direct impact of rs13047478 on

PWMs of the transcription factors enriched at rs13047478 region (Supplementary

Figure S11).

We next performed the eQTL analysis using the Gene-Tissue Expression (GTEx)

database (43) and unexpectedly revealed a significant association of rs13047478 with 20

several genes across many types of human tissues and cells. In particular, we found that

rs13047478 was in the eQTLs for the genes MCM3AP (P = 4.90 x 10-8) and YBEY (P

= 3.00 x 10-11) in normal breast tissues (Figure 4B,C). Chromatin looping data (44)

indicated direct physical interactions among rs13047478/C21orf58, MCM3AP and

YBEY in the breast cancer cell line MCF7 (Figure 4D). To directly test the effect of the

rs13047478-containing enhancer in MCM3AP or YBEY regulation, we inserted

rs13047478-centered DNA fragment upstream of the MCM3AP or YBEY promoter in a

pGL3-Basic vector and performed luciferase reporter assays in MCF7 cells (Figure

4E,F). The results showed that, compared to the A allele of rs13047478, the G allele

indicated a lower activity on the basal MCM3AP promoter and higher activity on the

YBEY promoter, consistent with the eQTL results, showing a significant association of

the G allele of rs13047478 with decreased mRNA levels of MCM3AP and elevated

expression of YBEY in breast mammary tissues (Figure 4B,C). Collectively, these

analyses suggest the causal effect of rs13047478 on the expression of MCM3AP and

YBEY.

A recent study showed that the expression of MCM3AP was significantly

decreased in human breast tumors (45). In addition, MCM3AP can serve as an

independent predictor and its lower expression was associated with poor prognosis of

patients with breast cancer. Mammary gland-specific MCM3AP knockout mice showed

severe impairment of mammary gland development during pregnancy and were more

likely to develop mammary gland tumors (45). Moreover, tumor formation also

occurred in female mice with MCM3AP heterozygosity. In addition, MCM3AP plays a 21

significant role in the suppression of DNA damage caused by estrogen in human breast

cancer cell lines (45). Together, these results indicated that the MCM3AP is associated

with breast cancer resistance. Consistent with these observations, we found that

MCM3AP was greatly downregulated in breast cancer compared with normal breast

samples in a cohort of over 2000 breast cancer patients (33) (Supplementary Figure

S12A). Furthermore, we observed that lower expression of MCM3AP showed a strong

association with decreased metastasis-free survival in a collection of 195 breast tumors

(46) (Supplementary Figure S12B). Together, these results suggest a causal role for

MCM3AP in breast cancer.

Though there were no suggested link of the other rs13047478 eQTL gene YBEY to

breast cancer, we observed that YBEY is likely to be essential for the growth and

survival of ER-positive breast cancer cells (Supplementary Figure S13A,B).

Furthermore, we found that YBEY was greatly upregulated in breast cancers in

comparison with adjacent normal breast tissues (33) (Supplementary Figure S13C).

Notably, higher expression of YBEY and lower mRNA levels of MCM3AP in breast

cancer are also consistent with stronger activity of YBEY promoter and weaker activity

of MCM3AP promoter, respectively, observed in the luciferase reporter experiments in

MCF7 cells (Figure 4E,F). Altogether, these analyses raised potential roles of the

rs13047478 eQTL genes MCM3AP and YBEY in breast cancer, and indicated a likely

function of rs13047478 as an exonic enhancer variant. Further work will be needed to

address whether rs13047478 could tag regulatory genetic variants within active

transcription factor binding sites and regulatory enhancers in regulating the expression 22

of MCM3AP and YBEY conferring breast cancer susceptibility.

The SNP rs3810151 at 19q13.2 is located within ZNF526 gene, and a missense

variant where the minor allele G results in a valine to alanine amino acid change in the

ZNF526. This single amino acid change also showed a slight effect on ZNF526

expression (Supplementary Figure S10C). ZNF526 is a member of zinc finger protein

family and its exact function is unclear. Interestingly, many zinc finger proteins have

been reported to be associated with breast cancer, such as ZNF365 (47-49), ZNF545

(50). Here, we show that ZNF526 may be essential for the survival of breast cancer

cells, and a striking upregulation of ZNF526 in breast cancer samples compared to

normal, indicating a potential function of ZNF526 in breast tumorigenesis (Figure 3

and Supplementary Figure S7-S9).

Similar to rs13047478, we observed that the rs3810151-containing region in

ZNF526 was also featured as a cis-regulatory element with several transcription factor

binding, active chromatin marks and enhancer activity (39) (Supplementary Figure

S14A,B) that may impact gene regulation. Regulatory motif analysis of the DNA

sequence surrounding rs3810151 showed that rs13047478 may alter PWMs of several

transcription factors (41,42) (Supplementary Table S7) but not for ESR1, FOXA1,

MYC and so on occupied at rs3810151 region (39) (Supplementary Figure S14A).

We next performed the eQTL analysis to examine whether the variant rs3810151

correlate with expression of nearby genes, using a publicly available database (51). This

analysis revealed a significant cis-association of rs3810151 with the expression of two

genes: PA FA H1 B 3 (P = 8.39 x 10-8) and CNFN (P = 3.77 x 10-4) (Supplementary 23

Figure S14C). Consistent with this, a long-range chromatin interaction between

rs3810151/ZNF526 and PA FA H1 B 3 was observed in the MCF7 breast cancer cells

using ChIA-PET method (44) (Supplementary Figure S15). PA FA H1 B 3 has been

previously identified as a key metabolic driver of breast cancer pathogenicity, which is

upregulated in primary human breast tumors and correlated with poor prognosis (52).

Metabolomic profiling suggests that PA FA H 1 B 3 inactivation attenuates cancer

pathogenicity through enhancing tumor-suppressing signaling lipids (52). Consistent

with this putative oncogenic role of PA FA H1 B 3 in breast cancer, our analysis of multiple

large-scale breast cancer data sets revealed that PA FA H1 B 3 was highly expressed in

breast tumor samples and significantly associated with poor prognosis of the patients

with breast cancer (26, 34, 35, 37, 46, 53) (Supplementary Figure S16). We also

observed a significant upregulation of the other rs3810151 eQTL gene CNFN in breast

tumor tissues (26, 34) (Supplementary Figure S17A,B), and increased risk for relapse

in breast cancer patients harboring the tumors with high mRNA levels of CNFN

expression (54) (Supplementary Figure S17C).

In addition to the two novel regulatory exonic SNPs, we identified a new non-

coding variant rs7807771 (7q21.11). Regulatory motif analysis indicated that

rs7807771 may impact several transcription factor DNA-bind PWMs (41,42)

(Supplementary Table S7) whereas observed no enrichment of transcription factor

binding and active chromatin marks at this region based on the query of a large

collection of ChIP-seq data (39). rs7807771 is located 332 kb upstream of SEMA3D.

SEMA3D is a member of the class-3 semaphorin family, and could inhibit tumor 24

development through affecting the expression of appropriate semaphorin receptors (55).

The expression SEMA3D is increased in pancreatic ductal adenocarcinoma (PDA)

tumors. Mouse PDA cells in which SEMA3D was knocked down exhibited decreased

invasive and metastatic potential in culture and in mice (56). Consistent with its

oncogenic property, we showed that SEMA3D was highly upregulated in breast cancer

tissues (34) (Supplementary Figure S18).

Protein-protein interaction and pathway enrichment analysis

Finally, we evaluated the connectivity at the protein-protein interaction (PPI) level for

the genes at 98 previously reported GWAS loci in breast cancer (Supplementary Table

S4) and the three novel loci discovered in this study (Table 2). Using the search tool to

retrieve the interacting genes/proteins (STRING database) (57), we observed a

significant PPI enrichment (P-value: 0, Hypergeometric test; Supplementary Figure

S19) for these genes, suggesting at least partially functional and biological connections

among them. The most significantly overrepresented pathways are related to the

mammary gland epithelium development (Supplementary Table S8) (P = 2.62 × 10-

6).

Discussion

In this study, to the best of our knowledge, we performed the first comprehensive

assessment of coding variation using the exome array for breast cancer in Han Chinese

women. This analysis led to the identification of two novel missense variants within

two uncharacterized genes (ZNF526 and C21orf58) in breast cancer, and a new non-

coding variant at 7q21.11. These loci have not been discovered in previous GWAS and

other genetic association studies in breast cancer. We demonstrated that ZNF526 and

C21orf58 played roles in breast cancer cell growth and disease progression. We

unexpectedly found that the two missense variants function as regulatory coding SNPs

in the eQTLs with several genes including MCM3AP, YBEY, PA FA H1 B 3 and CNFN

that are potentially important for breast cancer. Our findings suggest that genetic

variants and genes at these loci contribute to the development of breast cancer. Our

work highlights polygenic contributions to the pathogenesis of breast cancer and

identify additional susceptibility loci for breast cancer. We acknowledge that, although

the overall evidence reaches genome-wide significance for the three newly-identified

breast cancer risk loci, the sample size and thus the power of the validation study are

rather limited. Further studies in large independent samples will be needed to replicate

these findings. In addition, future studies involving fine-mapping of targeted regions

and deep functional studies are required to delineate the molecular mechanisms

underlying the risk variants identified in this study.

Acknowledgements

We thank the State Key Laboratory Incubation Base of Dermatology, Ministry of

National Science and Technology (Hefei, China) for providing research platform. This

work was supported by the Young Program of the National Natural Science Foundation

of China awarded to B. Zhang. (81301771), and the Academy of Finland (284618 and

279760), University of Oulu Strategic Funds and Jane & Aatos Erkko Foundation

grants awarded to G.-H.Wei.

References

1 DeSantis C, Ma J, Bryan L, and Jemal A. Breast cancer statistics, 2013. CA Cancer J Clin 2014;64: 52-62.

2 Hortobagyi GN, de la Garza Salazar J, Pritchard K, Amadori D, Haidinger R, Hudis CA, et al. The global breast cancer burden: variations in epidemiology and survival. Clin Breast Cancer 2005;6: 391-401.

3 Fachal L and Dunning AM. From candidate gene studies to GWAS and post-GWAS analyses in breast cancer. Curr Opin Genet Dev 2015;30: 32-41.

4 Complexo, Southey MC, Park DJ, Nguyen-Dumont T, Campbell I, Thompson E, et al. COMPLEXO: identifying the missing heritability of breast cancer via next generation collaboration. Breast Cancer Res 2013;15: 402.

5 So HC, Gui AH, Cherny SS, and Sham PC. Evaluating the heritability explained by known susceptibility variants: a survey of ten complex diseases. Genet Epidemiol 2011;35: 310-7.

6 Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 2012;337: 1190-5.

7 Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet 2007;80: 779-91.

8 Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, and Hobbs HH. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 2004;305: 869-72.

9 Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, et al. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet 2008;40: 592-9.

10 Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet 2008;24: 133-41.

11 Diogo D, Kurreeman F, Stahl EA, Liao KP, Gupta N, Greenberg JD, et al. Rare, low-frequency, and common variants in the protein-coding sequence of biological candidate genes from GWASs contribute to risk of rheumatoid arthritis. Am J Hum Genet 2013;92: 15-27.

12 Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, et al. Exome sequencing and the genetic basis of complex traits. Nat Genet 2012;44: 623-30.

13 Momozawa Y, Mni M, Nakamura K, Coppieters W, Almer S, Amininejad L, et al. Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat Genet 2011;43: 43-7. 28

14 Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet 2011;43: 1066-73.

15 Zhan X, Larson DE, Wang C, Koboldt DC, Sergeev YV, Fulton RS, et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nat Genet 2013;45: 1375-9.

16 Kozlitina J, Smagris E, Stender S, Nordestgaard BG, Zhou HH, Tybjaerg-Hansen A, et al. Exome- wide association study identifies a TM6SF2 variant that confers susceptibility to nonalcoholic fatty liver disease. Nat Genet 2014;46: 352-6.

17 Tang H, Jin X, Li Y, Jiang H, Tang X, Yang X, et al. A large-scale screen for coding variants predisposing to psoriasis. Nat Genet 2014;46: 45-50.

18 Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81: 559-75.

19 Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, and Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006;38: 904-9.

20 Higgins JP and Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002;21: 1539-58.

21 Higgins JP, Thompson SG, Deeks JJ, and Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;327: 557-60.

22 Mantel N and Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 1959;22: 719-48.

23 DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;7: 177-88.

24 Huang Q, Whitington T, Gao P, Lindberg JF, Yang Y, Sun J, et al. A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat Genet 2014;46: 126-35.

25 Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 2007;9: 166-80.

26 Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012 Oct 4;490(7418):61‐70.

27 R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R‐project.org/.

28 Chang X, Zhao Y, Hou C, Glessner J, McDaniel L, Diamond MA, et al. Common variants in MMP20 at 11q22.2 predispose to 11q deletion and neuroblastoma risk. Nat Commun. 2017 Sep 18; 8(1): 569.

29 Hoffmann TJ, Passarelli MN, Graff RE, Emami NC, Sakoda LC, Jorgenson E, et al. Genome‐wide association study of prostate‐specific antigen levels identifies novel loci independent of prostate cancer. Nat Commun. 2017 Jan 31;8: 14248.

30 Mhatre S, Wang Z, Nagrani R, Badwe R, Chiplunkar S, Mittal B, et al. Common genetic variation and risk of gallbladder cancer in India: a case‐control genome‐wide association study. Lancet Oncology 2017 Apr; 18: 535‐544.

31 Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005 Jan 15;21(2):263‐5.

32 Aguirre AJ, Meyers RM, Weir BA, Vazquez F, Zhang CZ, Ben-David U, et al. Genomic Copy Number Dictates a Gene-Independent Cell Response to CRISPR/Cas9 Targeting. Cancer Discov 2016;6: 914-29.

33 Finak G, Bertos N, Pepin F, Sadekova S, Souleimanova M, Zhao H, et al. Stromal gene expression predicts clinical outcome in breast cancer. Nat Med 2008;14: 518-27.

34 Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012;486: 346-52.

35 Ma XJ, Dahiya S, Richardson E, Erlander M, and Sgroi DC. Gene expression profiling of the tumor microenvironment during breast cancer progression. Breast Cancer Res 2009;11: R7.

36 Kao KJ, Chang KM, Hsu HC, and Huang AT. Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization. BMC Cancer 2011;11: 143.

37 Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population- based cohorts. Breast Cancer Res 2005;7: R953-64.

38 Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, et al. Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 2008;9: 239.

39 Mei, S., Qin, Q., Wu, Q., Sun, H., Zheng, R., Zang, C., Zhu, M., Wu, J., Shi, X., Taing, L., et al. 30

(2017). Cistrome Data Browser: a data portal for ChIP‐Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 45, D658‐D662.

40 Birnbaum RY, Clowney EJ, Agamy O, Kim MJ, Zhao J, Yamanaka T, et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res 2012;22: 1059-68.

41 Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. 2014 Mar;42(5):2976‐87.

42 Ward LD, Kellis M. HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease. Nucleic Acids Res. 2016 Jan 4;44(D1):D877‐81.

43 Consortium GT. The Genotype-Tissue Expression (GTEx) project. Nat Genet 2013;45: 580-5.

44 Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 2012;148: 84-98.

45 Kuwahara K, Yamamoto-Ibusuki M, Zhang Z, Phimsen S, Gondo N, Yamashita H, et al. GANP protein encoded on human chromosome 21/mouse chromosome 10 is associated with resistance to mammary tumor development. Cancer Sci 2016;107: 469-77.

46 Symmans WF, Hatzis C, Sotiriou C, Andre F, Peintinger F, Regitnig P, et al. Genomic index of sensitivity to endocrine therapy for breast cancer. J Clin Oncol 2010;28: 4111-9.

47 Cai Q, Long J, Lu W, Qu S, Wen W, Kang D, et al. Genome-wide association study identifies breast cancer risk variant at 10q21.2: results from the Asia Breast Cancer Consortium. Hum Mol Genet 2011;20: 4991-9.

48 Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet 2010;42: 504- 7.

49 Lindstrom S, Thompson DJ, Paterson AD, Li J, Gierach GL, Scott C, et al. Corrigendum: genome- wide association study identifies multiple loci associated with both mammographic density and breast cancer risk. Nat Commun 2015;6: 8358.

50 Xiao Y, Xiang T, Luo X, Li C, Li Q, Peng W, et al. Zinc-finger protein 545 inhibits cell proliferation as a tumor suppressor through inducing apoptosis and is disrupted by promoter methylation in breast cancer. PLoS One 2014;9: e110990.

51 Westra HJ, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet 2013;45: 1238-43.

52 Mulvihill MM, Benjamin DI, Ji X, Le Scolan E, Louie SM, Shieh A, et al. Metabolic profiling reveals PAFAH1B3 as a critical driver of breast cancer pathogenicity. Chem Biol 2014;21: 831- 40.

53 Pereira B, Chin SF, Rueda OM, Vollan HK, Provenzano E, Bardwell HA, et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat Commun 2016;7: 11479.

54 Ciriello G, Gatza ML, Beck AH, Wilkerson MD, Rhie SK, Pastore A, et al. Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell 2015;163: 506-19.

55 Kigel B, Varshavsky A, Kessler O, and Neufeld G. Successful inhibition of tumor development by specific class-3 semaphorins is associated with expression of appropriate semaphorin receptors by tumor cells. PLoS One 2008;3: e3287.

56 Foley K, Rucki AA, Xiao Q, Zhou D, Leubner A, Mo G, et al. Semaphorin 3D autocrine signaling mediates the metastatic role of annexin A2 in pancreatic cancer. Sci Signal 2015;8: ra77.

57 Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 2015;43: D447-52.

Table1. Summary of the samples analyzed in this study

Stage 1 (Exome_Asian array) Stage 2 (Replication) Characteristics Cases Controls Cases Controls

Sample size 8031 8035 2024 6072

Mean age (s.d) 50.8±11.4 37.8±15.0 49.7±11.7 40.4±13.6

Mean age of onset (s.d) 49.3±10.9 / 47.6±11.5 /

Downloaded from cancerres.aacrjournals.org on September 26, 2021. © 2018 American Association for Cancer Research. Downloaded from Author manuscriptshavebeenpeerreviewedandacceptedforpublicationbutnotyetedited. Author ManuscriptPublishedOnlineFirstonMarch23,2018;DOI:10.1158/0008-5472.CAN-17-1721

Table 2. The meta-analysis results of the two stages

cancerres.aacrjournals.org Exome-Asian Array Genotyping validation Meta

Chr SNP BP(hg19) Gene Function Allele

2 Cases Controls P value OR (95%CI) Cases Controls P value OR (95%CI) P P-HET I OR (95%CI)

19q13.2 rs3810151a 42728836 ZNF526 missense G/A 0.0638 0.0797 3.52E-08 0.78 (0.72─0.86) 0.0606 0.0699 4.33E-02 0.86 (0.74─0.99) 7.60E-09 0.3139 1.4 0.83 (0.77─0.89)

21q22.3 rs13047478a 47734659 C21orf58 missense A/G 0.2838 0.3104 6.79E-07 0.88 (0.84─0.93) 0.2863 0.3029 4.95E-02 0.92 (0.85─0.99) 4.52E-08 0.3174 0 0.90 (0.86─0.93)

7q21.11 rs7807771b 85148963 None intergenic G/A 0.0848 0.1026 4.37E-08 0.81 (0.75─0.87) 0.0810 0.0923 3.00E-02 0.87 (0.76─0.99) 6.71E-09 0.3855 0 0.84 (0.79─0.90) on September 26, 2021. © 2018American Association for Cancer

Research. a validation by Multiplex SnapShot technology; b validation by Sequenom MassArray system

Author Manuscript Published OnlineFirst on March 23, 2018; DOI: 10.1158/0008-5472.CAN-17-1721 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Figure Legends

Figure 1 A two-stage EWAS involving 10,055 cases and 14,107 controls was performed

in Han Chinese women. The stage 1 consisted of 8,031 cases and 8,035 controls. The

most highly significant SNPs were followed up in the stage 2 of replication with an

additional 2,024 cases and 6,072 controls.

Figure 2 Manhattan plot of the association evidence in the Exome_Asian Array (8,031

cases and 8,035 controls).

Figure 3 The rs13047478-associated gene C21orf58 and the rs3810151-associated

ZNF526 show potential effects on breast cancer cell survival and display high

expression breast cancer tissues. (A,B) Genome-wide loss-of-function screening of the

genes that are essential for the survival of the ER-negative breast cancer cell line

CAL120 (A), and the ER-positive breast cancer cell line T47D (B). Lower ATARiS

values indicate elevated dependency of the cells on given genes. BRD4, CCND1 and

MYC are known to be important for breast cancer cell growth and survival (29). Note

that the essentiality is strikingly higher for C21orf58 in comparing with CCND1 (A),

and ZNF526 than MYC (B). (C,D) Cell proliferation was measured at 5 days (C) and 3

days (D), respectively, by XTT colorimetric assay (absorbance at 450 nm (OD450); mean

± s.d. of 5 technical replicates. *P< 0.05, **P< 0.01, ***P< 0.001, ****P< 0.0001,

two-tailed Student’s t test. (E) The expression levels of two genes in cancerous and

normal tissues of the patients with breast cancer. Oncomine analysis of the Finak data

set (33) of 53 invasive breast tumors and 6 normal samples shows the expression of

both genes to be deregulated (P< 10-6) in breast cancer. Colors indicate z-score

normalized to depict relative values in each row. (F,G) C21orf58 (F) or ZNF526 (G)

mRNA expression was significantly upregulated in human breast cancers. The

horizontal lines represent the median values. The P values were calculated using Mann-

Whitney U-tests. The analyses are based on the data sets from Curtis et al (34) and

TCGA at the Oncomine database.

Figure 4 Association of rs13047478 genotype with the expression of MCM3AP and

YBEY, and the chromatin interaction between rs13047478/C21orf58 and the two eQTL

genes. (A) Luciferase reporter assays showing an enhancer activity of rs13047478-

centered genomic region in comparing with the control of pGL3-promoter vector. Note

that the randomly selected control region 1 and 2 shows no enhancer activity. NS, not

significant. (B,C) The expression quantitative trait locus (eQTL) analysis of the breast

cancer risk-associated SNP rs13047478. The eQTL analyses indicate that rs13047478

is associated with the expression of two genes MCM3AP (B) and YBEY (C),

respectively, in breast tissue samples. Note that the risk allele G of rs13047478 is

associated with decreased mRNA levels of MCM3AP, and increased expression of

YBEY. The analyses are based on the source data of the GTEx project (43). (D) The

chromatin interactions among C21orf58 (rs13047478), MCM3AP and YBEY were

defined by ChIA-PET experiments in the breast cancer cell line MCF-7 (44). (E,F)

Luciferase reporter assays indicate increased enhancer activity with the A allele of

rs13047478 relative to the G allele for the MCM3AP promoter (E), and reversely for

the YBEY promoter. In A,E,F, error bars show mean ± s.d. (n = 4 technical replicates).

The P values were evaluated using two-tailed Student’s t-tests. *** P < 0.001; **** P

< 0.0001.

A large-scale, exome-wide association study of Han Chinese women identifies three novel loci predisposing to breast cancer

Bo Zhang, Men-Yun Chen, Yu-Jun Sheng, et al.

Cancer Res Published OnlineFirst March 23, 2018.

Updated version Access the most recent version of this article at: doi:10.1158/0008-5472.CAN-17-1721

Supplementary Access the most recent supplemental material at: Material http://cancerres.aacrjournals.org/content/suppl/2018/03/23/0008-5472.CAN-17-1721.DC1

Author Author manuscripts have been peer reviewed and accepted for publication but have not yet Manuscript been edited.

E-mail alerts Sign up to receive free email-alerts related to this article or journal.

Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Subscriptions Department at [email protected].

Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/early/2018/03/23/0008-5472.CAN-17-1721. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site.