Supplementary Information for

Can cancer GWAS variants modulate immune cells in the tumor microenvironment?

Yi Zhang1,2, Mohith Manjunath2, Jialu Yan2,3, Brittany A. Baur4, Shilu Zhang4, Sushmita Roy4,5, and Jun S. Song2,3,*

1Department of Bioengineering, 2Carl R. Woese Institute for Genomic Biology, 3Department of

Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. 4Wisconsin

Institute for Discovery, 5Department of Biostatistics and Medical Informatics, University of

Wisconsin–Madison, Madison, WI 53792, USA.

*Correspondence: [email protected]

SUPPLEMENTARY METHODS

Genome-wide association studies (GWAS) variants

A list of estrogen receptor-positive (ER+) breast cancer-associated variants were first obtained from Michailidou et al.1 and the NHGRI-EBI GWAS catalog2. To identify the GWAS variants associated with immuno-inflammatory traits, we used the NHGRI-EBI GWAS catalog v1.0.2 and the annotation table from the GWAS catalog mapping the GWAS traits to the parent disease category using information from the Experimental Factor Ontology (EFO) database3. Using the

“Parent term” column, we first selected SNP-trait associations from the GWAS catalog containing the term “Immune system disorder” or “Inflammatory measurement”. We also included additional traits in other disease categories with recursive ontology parents containing the keyword “immun” or “inflamm”. For example, Crohn’ disease was categorized as a digestive

1 system disorder in the GWAS EFO mapping file, but we included it in our study since it is under the “inflammatory bowel disease” category in the EFO ontology database. Finally, we obtained a set of 3404 SNPs associated with immuno-inflammatory traits. For each ER+ breast cancer

GWAS SNP, we scanned the proximal region using a moving window of length 100 kb containing the SNP and counted the number of immuno-inflammatory GWAS SNPs within each window. The maximum number of immuno-inflammatory SNPs in these running windows was then recorded for each breast cancer GWAS SNP. When the breast cancer GWAS SNPs were ranked based on these maximum counts, rs3903072 emerged as the top SNP.

A supplementary table from Michailidou et al.4 provides all SNPs associated with breast cancer with p-value less than 10. However, as Michailidou et al.4 used the p-value threshold of 10 in the final report, we mainly focused on rs3903072 (� = 2 × 10) rather than the nearby breast cancer GWAS SNP rs617791 showing a weaker association (� = 7 × 10, Figure 1A).

The Cancer Genome Atlas (TCGA) cancer data

The germline genotypes at tag SNPs of breast cancer (BRCA) patients in the TCGA dataset were downloaded from the TCGA Data Portal. The tumor copy number segmentation data in hg19 from the NCI Genomic Data Commons (GDC) Legacy Archive5 were used to compute copy number (CN). The processed gene expression data in Fragments Per Kilobase Million

(FPKM) measured by RNA-seq were downloaded from the TCGA GDC data portal5. Germline genotypes from normal tissues and CN/RNA-seq data from tumor tissues were matched using

TCGA barcodes representing patients. For three other TCGA cancer datasets – Uterine Corpus

Endometrial Carcinoma (UCEC), Head-Neck Squamous Cell Carcinoma (HNSC) and Low

2 Grade Glioma (LGG) – only the germline genotypes and processed gene expression levels were downloaded.

The Genotype-Tissue Expression (GTEx) project data

The GTEx6; 7 gene expression levels in RPKM (Reads Per Kilobase Million) were downloaded from the GTEx portal (file: GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct), and the annotations for the samples and tissues were obtained from

GTEx_Data_V6_Annotations_SampleAttributesDS.txt. We analyzed the GTEx data with two aims: comparing the expression levels of a certain gene across different tissues, and analyzing the correlation between the genotype at a GWAS SNP and candidate target gene expression level.

For the first purpose, we used the mean expression level of CTSW measured in the GTEx tissues

(Figure 2D). The tissue-wise distributions of other nearby were also shown using the

UCSC GTEx track8 (Supplementary Figure 5). For the second aim, we used the fully processed, filtered and normalized gene expression levels in the breast mammary tissue and the whole blood tissue from GTEx Analysis V7 (dbGaP Accession phs000424.v7.p2); the imputed genotypes were extracted from the controlled-access dbGaP Accession phg000520.v2 (GTEx V2) dataset.

Genotype imputation for TCGA datasets

For genotype imputation in BRCA, UCEC, HNSC and LGG datasets, raw genotypes with genotype confidence score greater than 0.1 in birdseed format were marked as missing genotypes to be imputed along with the non-probed SNPs. The filtered genotypes were then uploaded to the

Michigan Imputation Server9 for genotype imputation, selecting the Haplotype Reference

3 Consortium (HRC) r1.1 201610 as a reference panel, Eagle v2.311 for phasing, and EUR population as the quality control option.

Expression quantitative trait loci (eQTL) analysis for TCGA-BRCA

In this paper, a pilot eQTL analysis was first performed among ER+ breast cancer patients. For this ER+ breast cancer analysis, we constructed a multivariate linear model, regressing the gene expression levels against the genotypes at the GWAS SNP rs3903072 as well as the gene CN12.

The gene expression levels in FPKM were log-transformed as ���(���� + 1) and taken to be the response. The SNP genotypes were encoded as the number of risk alleles (0, 1 or 2), and the gene CN was computed by taking gene length-weighted average of tumor CN segmentation data, transforming the segmentation unit back to the copy number unit (�� = 2 × 2).

Multivariate linear regression was then performed for each gene within the 3 Mb region centered at rs3903072. Genes with ���� ≥ 1 (����: mean expression among tumor samples) and the genotype p-value ≤ 0.05 from the linear regression were selected for further investigation.

Among them, CTSW was identified as a signal different from other genes, as described in the main manuscript; the three significant eQTL genes near CTSW – FIBP, MUS81 and EIF1AD – were then selected for comparison with CTSW in Figure 1B, because of their strong eQTL correlation that was also observed in an earlier study1.

A second stage of eQTL analyses was performed to validate the genotype correlation of CTSW in other cancer types. For these eQTL analyses using BRCA, UCEC, HNSC LGG and GTEx data, linear regression models between CTSW expression and the genotype status at rs3903072 were constructed. All RNA-seq FPKM values from TCGA were log-transformed as in BRCA. For

4 GTEx, the normalized gene expression matrices of whole blood and breast mammary tissues were downloaded from GTEx Analysis V7 release (dbGaP Accession phs000424.v7.p2).

TCGA survival analysis

Survival analysis in TCGA ER+ breast cancer patients was performed using the clinical data obtained from TCGA-GDC. The differences in survival rate between the two breast cancer patient groups separated by CTSW median expression level was tested using log-rank test with the R packages survival13 and survminer14. Survival analysis results were also obtained in endometrial cancer (UCEC), head and neck cancer (HNSC), and renal cancer, all from the human atlas (THPA)15, choosing the median expression level as the cutoff threshold for grouping patients (UCEC: https://www.proteinatlas.org/ENSG00000172543-

CTSW/pathology/tissue/endometrial+cancer; HNSC: https://www.proteinatlas.org/ENSG00000172543-

CTSW/pathology/tissue/head+and+neck+cancer; Renal cancer: https://www.proteinatlas.org/ENSG00000172543-CTSW/pathology/tissue/renal+cancer). The three datasets in renal cancer were checked separately, including Kidney Renal Clear Cell

Carcinoma (KIRC; https://www.proteinatlas.org/ENSG00000172543-

CTSW/pathology/tissue/renal+cancer/KIRC), Kidney Renal Papillary Cell Carcinoma (KIRP; https://www.proteinatlas.org/ENSG00000172543-CTSW/pathology/tissue/renal+cancer/KIRP), and Kidney Chromophobe (KICH; https://www.proteinatlas.org/ENSG00000172543-

CTSW/pathology/tissue/renal+cancer/KICH).

CTSW expression and promoter transcription activity

5 CTSW gene expression in a variety of tissues and cell lines was obtained from BioGPS

GeneAtlas16, cancer cell line encyclopedia (CCLE)17 and functional annotation of the mammalian genome (FANTOM)18 resources. Microarray gene expression data for CTSW were directly downloaded from the BioGPS web resource. There were 176 samples including replicates, and the mean expression of replicates was calculated for each tissue. CCLE gene expression data in RPKM units from RNA-seq data of 1156 samples were downloaded from the

CCLE website (file: CCLE_DepMap_18q3_RNAseq_RPKM_20180718.gct). FANTOM gene expression data in TPM units was obtained from the FANTOM web resource (file: hg19.gene_phase1and2combined_tpm.osc.txt). The FANTOM data consisted of 1829 samples

(tissue and cell type information obtained from HumanSamples2.0.sdrf.xlsx).

Chromatin accessibility at CTSW promoter

DNase-seq chromatin accessibility measurements in various tissues were obtained from

Encyclopedia of DNA Elements (ENCODE)19 and the Roadmap Epigenomics project20. For the

ENCODE data, the DNase I hypersensitivity sites (DHS) tracks of all cell types in ENCODE

Tier 1 were displayed, together with the three cell lines related to breast tissue (HMEC, MCF-7,

T-47D). We also examined all cell types in ENCODE Tier 2 and Tier 3, selecting the ones with a

DHS at CTSW promoter to include in Figure 2C. For the Roadmap Epigenomics data, we displayed the wiggle track of the first DNase-seq replicate in each cell type (Figure 2C).

ChIA-PET data analysis

We searched the ENCODE and Gene Expression Omnibus (GEO)21 databases for available three-dimension (3D) chromatin interaction data in lymphocyte cell lines expressing CTSW, and

6 found two ChIA-PET datasets in the Jurkat cell line for the SMC1 (GSE6897822) and

RAD21 (ENCODE Accession ENCSR361AYD). For the SMC1 ChIA-PET data, we used the significant interactions processed and merged by the authors from GEO. For the RAD21 ChIA-

PET data, we collected all the raw sequences and generated the chromatin interactions using

ChIA-PET 223 with default parameters.

Random forest regression approach for predicting high-resolution chromatin contact counts

To predict high-resolution Hi-C interactions around the GWAS SNP rs3903072 in Natural Killer cells, T cells and vHMEC, we trained a local random forest regression model within1 Mb of the

SNP, using an approach similar to our previously published method24. vHMEC serves as a control cell line, since it is a homogeneous, non-cancerous mammary epithelial cell line and does not contain other cell types such as T cells and NK cells. We trained the models on region-pairs involving the chr11:65580000-65585000 (hg19) 5 kb bin, which overlaps the GWAS SNP and

PRE1, and 5 kb bins within 1 Mb from the SNP, using published high-resolution (5 kb) Hi-C datasets in five different cell lines25 and complementary one-dimensional signals as features; these signals were histone marks, DNase I and DNase I accessible sequence specific motifs for

CTCF, RAD21 and TBP26. Histone marks and DNase I data were obtained from ENCODE and the Roadmap Epigenomics Project for the five training (GM12878, K562, HUVEC, NHEK,

HMEC) and three test (NK, T cells and vHMEC) cell lines. Since histone datasets for all the features were not available in vHMEC, we used imputed signals from the Avocado pipeline27.

Data processing and normalization were done as described in Zhang et al.24 and included normalization for sequencing depth and collapsing replicates by median. To account for overall

7 differences in signal across cell lines, we additionally discretized each of 5 kb ChIP-seq signals using k-means clustering with k=20.

A region was represented as a 10-dimensional feature vector, each dimension corresponding to one of the 10 genome-wide datasets (6 histone ChIP-seq, DNase-seq and 3 DNase-seq derived motifs). Features for a pair of regions were obtained by concatenating the 10-dimensional feature vectors of the two regions together with the feature vector of the intervening region between the two regions and the distance between the two regions to obtain a feature vector of size 31. The feature associated with the intervening region was the mean signal value of the features in the region for ChIP-seq and DHS, similar to the “WINDOW” feature in TargetFinder28. Once trained, we used the models to generate contact count predictions in the 1 Mb region in NK cells,

CD8+ αβ T cells and vHMEC, using feature datasets from the Roadmap Epigenomics database.

Prioritization of functional SNPs linked to GWAS SNPs

We first selected all common (��� ≥ 0.05) SNPs from 1000 Genomes Project Phase 3 in high linkage disequilibrium (LD) (� ≥ 0.8) with rs3903072. All LD values were calculated using the haplotypes from the 1000 Genomes Project Phase 3 EUR population, since the original GWAS

SNP rs3903072 was discovered in the European population. To prioritize SNPs located in putative regulatory regions (PREs), we collected DHS from ENCODE and the Roadmap

Epigenomics project in lymphocytes-related cells, such as T cells, NK cells, B cells, T helper cells and common myeloid progenitor cells (Supplementary Table 5). The LD SNPs overlapping any of the DHS peaks were prioritized for further investigation. In addition to DHS,

8 we also used H3K4me1 modification (processed wiggle track in Jurkat cells from GSE11943929) to prioritize SNPs within putative regulatory elements.

Motif analysis

TF position-specific weight matrices (PWM) were collected from HOCOMOCO Human v1030,

FACTORBOOK31; 32, TRANSFAC33, JASPAR vertebrates34 and Jolma201335. To identify potential binding sites affected by SNPs, we used the program FIMO36 (version 4.12.0) to scan the 51 bp sequences carrying either allele of each prioritized SNP in the center (FIMO threshold

10). The statistical significance of motif disruption or creation effect of the SNP alleles was then measured using our previous method of simulating null mutations in motif sequences12.

Motif logos were plotted using WebLogo337 based on PWMs or their reverse complement matrices.

ChIP-seq analysis

ChIP-seq data for relevant TFs were collected from ENCODE and GEO. Processed wiggle tracks and peaks were downloaded and presented when available in hg19 (Jurkat H3K4me1 ChIP-seq track from GSE11943929; TBX21 ChIP-seq pooled wiggle track and peaks in GM12878 from

ENCODE ENCFF193RDB and ENCFF869HSY). For TCF3 ChIP-seq data in Kasumi1 and

KLF1 ChIP-seq in GM12878, the raw sequences were downloaded from GSE4383438 and

GSE4362539, respectively, and processed using SRA Toolkit40, mapped to hg19 using BWA41 ( - n 2), and analyzed for peaks using MACS242 (callpeak: -q 0.1 --SPMR; bdgcmp: -m FC).

9 SUPPLEMENTARY TABLES

Supplementary Table 1. GWAS traits around rs3903072 for the region shown in Figure 1A.

GWAS SNP Trait Category GWAS traits reported in each PUBMED ID for ID study each study rs10750766 Blood cells-related Diastolic blood pressure x alcohol 29912962 ; consumption interaction (2df test) ; 27863252 ; High light scatter reticulocyte 27863252 ; count ; High light scatter 27863252 ; reticulocyte percentage of red cells ; 29912962 Immature fraction of reticulocytes ; Systolic blood pressure x alcohol consumption interaction (2df test) rs10791824 Immuno- Atopic dermatitis 26482879 inflammatory rs10896045 Blood cells-related Blood protein levels 29875488 rs11227302 Immuno- Systemic lupus erythematosus 28714469 inflammatory rs11227306 Other traits DNA methylation (variation) 23725790 rs11602769 Other traits Allergic sensitization 30013184 rs11604462 Other traits Glomerular filtration rate 28452372 (creatinine) rs118086960 Immuno- Psoriasis 28537254 inflammatory rs12223803 Blood cells-related Albumin-globulin ratio 29403010 rs12576766 Other traits Serum uric acid levels 29403010 rs185542523 Other traits Maximum cranial width 29698431 rs201316070 Other traits Systolic blood pressure x smoking 29455858 ; status (current vs non-current) 29455858 interaction (2df test) ; Systolic blood pressure x smoking status (ever vs never) interaction (2df test) rs2231884 Immuno- Inflammatory bowel disease 23128233 inflammatory rs3825068 Blood cells-related Blood protein levels 29875488 rs3903072 Breast cancer Breast cancer ; Breast cancer ; 23535729 ; Breast cancer 25751625 ; 29059683 rs4014195 Other traits Chronic kidney disease ; Glomerular 20383146 ; filtration rate (creatinine) 26831199 rs478304 Immuno- Acne (severe) ; Spherical equivalent 24927181 ; inflammatory or myopia (age of diagnosis) 29808027

10 rs479844 Other traits Allergic disease (asthma, hay fever 29083406 ; or eczema) ; Atopic dermatitis ; 22197932 ; Atopic dermatitis ; Atopic march 26482879 ; 26542096 rs489574 Immuno- Systemic lupus erythematosus 28714469 inflammatory rs494003 Immuno- Systemic lupus erythematosus ; 26502338 ; inflammatory Systemic lupus erythematosus 27399966 rs526631 Blood cells-related Eosinophil percentage of 27863252 ; granulocytes ; Neutrophil percentage 27863252 of granulocytes rs568617 Immuno- Chronic inflammatory diseases 26974007 ; inflammatory (ankylosing spondylitis, Crohn's 26192919 disease, psoriasis, primary sclerosing cholangitis, ulcerative colitis) (pleiotropy) ; Crohn's disease rs5792377 Other traits Heel bone mineral density 30048462 rs593982 Immuno- Atopic dermatitis 23042114 inflammatory rs617791 Other traits Breast cancer 29059683 rs634534 Blood cells-related Eosinophil counts ; Sum eosinophil 27863252 ; basophil counts 27863252 rs637571 Blood cells-related Eosinophil percentage of white cells 27863252 rs642803 Other traits Educational attainment (MTAG) ; 30038396 ; Highest math class taken (MTAG) ; 30038396 ; Urate levels 23263486 rs7123489 Other traits Creatinine levels 29403010 rs72941051 Other traits Diastolic blood pressure x smoking 29455858 ; status (current vs non-current) 29455858 ; interaction (2df test) ; Diastolic 29455858 ; blood pressure x smoking status 29455858 (ever vs never) interaction (2df test) ; Systolic blood pressure x smoking status (current vs non- current) interaction (2df test) ; Systolic blood pressure x smoking status (ever vs never) interaction (2df test) rs77291001 Other traits Maximum cranial width 29698431 rs77779142 Immuno- Rosacea symptom severity* 29771307 inflammatory rs9795139 Other traits Serum uric acid levels 29403010 * rs77779142 is not shown in Figure 1A, because Rosacea symptom was not listed as an immune-related disease in the original GWAS annotation; it is recorded here as an immuno- inflammatory variant, since Rosacea is an inflammatory skin condition.

11 Supplementary Table 2. Genetic linkage between rs3903072 and the nearby immuno- inflammatory variants.

Immuno- �� D' Immuno- Immuno- Allele correlated inflammatory inflammatory inflammatory with rs3903072-G SNP risk allele risk allele (risk) frequency rs478304 0.0598 0.2474 T 0.534 Linkage Equilibrium rs593982 0.0227 0.4476 C 0.883 Linkage Equilibrium rs494003 0.1078 0.737 A 0.189 A rs489574 0.0003 0.0226 A 0.343 Linkage Equilibrium rs479844 0.3262 0.6931 G 0.557 A rs11227302 0.1057 0.7416 A 0.184 A rs10791824 0.3234 0.7157 G 0.575 A rs118086960 0.0233 0.1553 T 0.532 Linkage Equilibrium rs77779142 0.1673 1.0000 T 0.164 T rs568617 0.1851 0.9657 T 0.189 T rs2231884 0.1586 0.9737 T 0.164 T

12 Supplementary Table 3. List of eQTL genes correlated with the rs3903072 genotype within 3 Mb of the SNP.

Available as a separate supplementary file.

13 Supplementary Table 4. Annotation of genes in the rs3903072-CTSW region from PANTHER43.

Gene PANTHER Mappe Name; Gene ID Family/Subfa PANTHER Protein Class d IDs Gene mily Symbol AP-5 AP-5 COMPLEX complex HUMAN|HGNC=25104|U SUBUNIT AP5B1 subunit niProtKB=Q2VPB7 BETA-1 beta- (PTHR34033:S 1;AP5B1 F1) Galactose- GALACTOSE- 3-O- 3-O- HUMAN|HGNC=24144|U GAL3 sulfotransf SULFOTRAN niProtKB=Q96A11 ST3 erase SFERASE 3 3;GAL3ST (PTHR14647:S 3 F76) FOS- Fos-related RELATED basic leucine zipper HUMAN|HGNC=13718|U FOSL1 antigen ANTIGEN 1 transcription niProtKB=P15407 1;FOSL1 (PTHR23351:S factor(PC00056) F6) HISTONE Histone acetyltransferase(PC00038); ACETYLTRA acetyltrans chromatin/chromatin-binding HUMAN|HGNC=5275|Uni NSFERASE KAT5 ferase protein(PC00077);zinc ProtKB=Q92993 KAT5 KAT5;KA finger transcription (PTHR10615:S T5 factor(PC00244) F124) SORTING Sorting HUMAN|HGNC=26423|U NEXIN-32 SNX32 nexin- niProtKB=Q86XE0 (PTHR45850:S 32;SNX32 F3) ACIDIC Acidic FIBROBLAST fibroblast GROWTH growth FACTOR HUMAN|HGNC=3705|Uni factor INTRACELLU FIBP ProtKB=O43427 intracellula LAR- r-binding BINDING protein;FI PROTEIN BP (PTHR13223:S F2)

14 TESTIS- Testis- SPECIFIC specific PROTEIN 10- HUMAN|HGNC=26555|U TSGA protein 10- INTERACTIN niProtKB=Q3SY00 10IP interacting G PROTEIN protein;TS (PTHR21501:S GA10IP F5) SPLICING Splicing FACTOR 3B HUMAN|HGNC=10769|U factor 3B SF3B2 SUBUNIT 2 niProtKB=Q13435 subunit (PTHR12785:S 2;SF3B2 F6) CYSTATIN-M HUMAN|HGNC=2478|Uni Cystatin- CST6 (PTHR47033:S ProtKB=Q15828 M;CST6 F1) CATHEPSIN cysteine HUMAN|HGNC=2546|Uni Cathepsin W CTSW protease(PC00081);protease ProtKB=P56202 W;CTSW (PTHR12411:S inhibitor(PC00191) F101) U4/U6.U5 TRI- U4/U6.U5 SNRNP- tri-snRNP- HUMAN|HGNC=10538|U SART ASSOCIATED extracellular matrix associated niProtKB=O43290 1 PROTEIN 1 protein(PC00102) protein (PTHR14152:S 1;SART1 F5) TRANSCRIPT Putative ION FACTOR transcriptio HUMAN|HGNC=8525|Uni OVOL OVO-LIKE 1- n factor ProtKB=O14753 1 RELATED Ovo-like (PTHR10032:S 1;OVOL1 F217) Cation CATION channel CHANNEL sperm- SPERM- HUMAN|HGNC=17116|U CATS associated ASSOCIATED niProtKB=Q8NEC5 PER1 protein PROTEIN 1 1;CATSPE (PTHR47193:S R1 F1) Ribonuclea RIBONUCLE HUMAN|HGNC=24116|U se H2 ASE H2 niProtKB=Q8TDP1 EH2C subunit SUBUNIT C C;RNASE (PTHR47063:S

15 H2C F1) COFILIN-1 HUMAN|HGNC=1874|Uni Cofilin- non-motor actin binding CFL1 (PTHR11913:S ProtKB=P23528 1;CFL1 protein(PC00165) F17) CYSTATIN-B HUMAN|HGNC=2482|Uni Cystatin- cysteine protease CST6 (PTHR11414:S ProtKB=P04080 B;CSTB inhibitor(PC00082) F22) BARRIER-TO- Barrier-to- AUTOINTEG autointegra HUMAN|HGNC=17397|U BANF RATION tion niProtKB=O75531 1 FACTOR factor;BA (PTHR12912:S NF1 F10) PHOSPHOFU Phosphofur RIN ACIDIC in acidic CLUSTER HUMAN|HGNC=30032|U cluster PACS1 SORTING niProtKB=Q6VY07 sorting PROTEIN 1 protein (PTHR13280:S 1;PACS1 F16) RNA- Probable BINDING RNA- PROTEIN HUMAN|HGNC=28147|U EIF1A binding EIF1AD- niProtKB=Q8N9N8 D protein RELATED EIF1AD;E (PTHR21641:S IF1AD F0) EGF- CONTAINING annexin(PC00050);calmodul EGF- FIBULIN- in(PC00061);cell adhesion containing LIKE molecule(PC00069);extracell fibulin-like HUMAN|HGNC=3219|Uni EFEM EXTRACELL ular matrix extracellul ProtKB=O95967 P2 ULAR glycoprotein(PC00100);extra ar matrix MATRIX cellular matrix structural protein PROTEIN 2 protein(PC00103);signaling 2;EFEMP2 (PTHR24034:S molecule(PC00207) F96) UPF0696 UPF0696 PROTEIN HUMAN|HGNC=28801|U C11orf protein C11ORF68 niProtKB=Q9H3H3 68 C11orf68; (PTHR31977:S C11orf68 F1)

16 Crossover CROSSOVER junction JUNCTION HUMAN|HGNC=29814|U MUS8 endonuclea ENDONUCLE niProtKB=Q96NY9 1 se ASE MUS81 MUS81;M (PTHR13451:S US81 F0)

17

Supplementary Table 5. List of DNase-seq peak files for lymphocytes.

Cell line ENCODE accessions numbers of DHS peak files (treatment) Cd4+, helper T ENCFF569GSL, ENCFF907KBL, ENCFF988HSM cell Cd8+, alpha- ENCFF071RMN, ENCFF422HNI, ENCFF662NTN beta T cell Jurkat clone ENCFF582KJR, ENCFF837AOM E61 Natural killer ENCFF224LJW, ENCFF933OXV cell T cell ENCFF026SFK, ENCFF286UIJ, ENCFF304TBE, ENCFF345YDG, ENCFF923EVD B cell ENCFF210RAG, ENCFF654IWG, ENCFF772OPR T helper17 cell ENCFF001WCL, ENCFF001WTC T helper1 cell ENCFF434LIX, ENCFF001WCS, ENCFF001WTE, ENCFF001WTI, ENCFF001WTL, ENCFF773IYG, ENCFF001WCQ, ENCFF001WTF, ENCFF001WTM T helper2 cell ENCFF570MGY, ENCFF001WCW, ENCFF001WTO, ENCFF001WTS, ENCFF001WTU, ENCFF001WCU, ENCFF001WTQ Common ENCFF037XOG, ENCFF387SIU, ENCFF401NSY, ENCFF457SNT, myeloid ENCFF479XZN, ENCFF499OEI, ENCFF600EJV, ENCFF664WUJ, progenitor, ENCFF686ZXP, ENCFF727NEX, ENCFF770BVB, ENCFF918ICP, CD34+ ENCFF182JTX, ENCFF264UIE, ENCFF927MCZ

18 SUPPLEMENTARY FIGURES

8

6

4

SNP count 2

0 rs616488 rs3903072 rs2588809 rs4808801 rs9790517 rs4322600 rs6762644 rs1432679 rs2046210 rs1011970 rs11780156 rs11552449 rs13329835 rs10995190 rs17356907 rs17817449 rs10069690 rs140068132

Supplementary Figure 1. Number of GWAS variants associated with immuno-inflammatory traits around each ER+ breast cancer GWAS SNP. Only those breast cancer GWAS SNPs with a non-zero count of immuno-inflammatory SNPs within +/- 100 kb are shown.

19

A BRCA UCEC n = 245 n = 306 n = 129 n = 186 n = 220 n = 94 -5 -3 5 p = 2.79×10 6 p = 1.52×10

5 4 4 3 3 2 2 2

log (FPKM+1) log (FPKM+1) 2

1 1

0 0 rs3903072 risk/ protective/ protective/ rs3903072 risk/ protective/ protective/ Genotype risk risk protective Genotype risk risk protective HNSC LGG 6 n = 161 n = 200 n = 105 n = 145 n = 251 n = 97 -3 2 .5 -3 5 p = 5.45×10 p = 7.09×10

2 .0 4

1 .5 3 2 2 2 1 .0 log (FPKM+1) log (FPKM+1)

1 0 .5

0 0 .0 rs3903072 risk/ protective/ protective/ rs3903072 risk/ protective/ protective/ Genotype risk risk protective Genotype risk risk protective

B GTEx Breast GTEx Whole Blood 3 n = 61 n = 76 n = 35 n = 101 n = 147 n = 65 3 2 2 1 1

0 0

−1 −1

−2 −2

−3 GTEx Normalized Expression −3 -4 GTEx Normalized Expression p = 8.64×10 p = 1.60×10-2 rs3903072 risk/ protective/ protective/ rs3903072 risk/ protective/ protective/ Genotype risk risk protective Genotype risk risk protective

Supplementary Figure 2. Violin plots for the eQTL analysis in different datasets. A linear model was constructed between the CTSW expression level and the genotype status at the GWAS SNP rs3903072; the p-values shown are for the linear coefficient of genotype. Gene copy number is not included in the model for this figure. (A) eQTL analysis in cancers from TCGA, using ER+ breast cancer subtype in BRCA, endometrial cancer (UCEC), head and neck cancer (HNSC), and low grade glioma (LGG). (B) eQTL analysis in normal tissues from GTEx, using mammary tissue and whole blood tissue.

20 PLACENTA EYE BUCCAL OESOPHAGUS ADRENAL_CORTEX LARGE_INTESTINE BILIARY_TRACT FIBROBLAST OVARY AUTONOMIC_GANGLIA BREAST PLEURA LIVER PANCREAS SMALL_INTESTINE SOFT_TISSUE BONE CERVIX ENDOMETRIUM PROSTATE STOMACH SKIN LUNG UPPER_AERODIGESTIVE_TRACT CENTRAL_NERVOUS_SYSTEM SALIVARY_GLAND THYROID KIDNEY URINARY_TRACT HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 0 5 10 15 20 25 30 RPKM

Supplementary Figure 3. CTSW expression across different cell lines using data from CCLE. Cell lines are grouped by their tissue type; tissues are ranked based on the mean CTSW expression of cell lines within each group. An error bar is also shown for each group indicating standard deviation.

21 iPS osteoclast smooth muscle cell chondrocyte urothelial cell granulosa cell nucleus pulposus cell of intervertebral disc clara cell epithelial cell glioblast hair follicle, dermal papilla cell adipocyte ciliated epithelial cell macrophage epithelial cell of lung, small airway satellite cell, skeletal muscle b cell synovial cell monoblast hepatocyte tracheal epithelial cell lymphocyte human monocyte-derived macrophage eosinophil progenitor cell dendritic cell, myeloid, immature Cell mixture − tissue sample neutrophil langerhans cell monocyte basophil progenitor cell blood vessel endothelial progenitor cell stem cell dendritic cell, plasmacytoid t cell granulocyte monocyte progenitor cell t cell, CD4+ eosinophil mast cell hematopoietic stem cell, CD34+ myeloid progenitor cell megakaryoblast common myeloid progenitor granulocyte macrophage progenitor promyelocytes/myelocytes mononuclear cell basophil t cell, CD8+ t cell, nk, immature natural killer cell t cell, gamma−delta 0 500 1000 1500 TPM

Supplementary Figure 4. The CTSW promoter transcription activity measured by FANTOM5 for different cell types. Top 50 cell types are shown, ranked by mean CTSW expression within each cell type. The cell types listed are obtained from an annotation file provided by the FANTOM5 consortium. An error bar is also shown for each cell type indicating standard deviation. TPM, tags per million.

22 Scale 100 kb hg19 chr11: 65,500,000 65,600,000 65,700,000 65,800,000 UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Adipose RELA KAT5 AP5B1 SNX32 C11orf68 EIF1AD PACS1 RELA KAT5 OVOL1 CFL1 DRAP1 EIF1AD PACS1 Adipose RELA KAT5 OVOL1 CFL1 TSGA10IP CST6 Adrenal Gland RELA KAT5 SNX32 C11orf68 EIF1AD RELA KAT5 SNX32 TSGA10IP AX746604 Artery RELA MUS81 TSGA10IP GAL3ST3 Artery RELA MUS81 SART1 GAL3ST3 RELA EFEMP2 SART1 SF3B2 Artery RNASEH2C EFEMP2 SART1 SF3B2 Brain RNASEH2C CTSW EIF1AD FIBP EIF1AD Breast FIBP EIF1AD EBV-transformed FIBP EIF1AD Cells lymphocytes Transformed FIBP BANF1 Cells fibroblast CCDC85B BANF1 Colon FOSL1 AX747517 FOSL1 AX747517 Esophagus CATSPER1 Esophagus SF3B2 Gene Expression in 53 tissues from GTEx RNA-seq of 8555 samples (570 donors) Heart RELA SNX32 CST6 Heart RN7SL309P CTSW PACS1 Lung KAT5 FIBP Muscle RNASEH2C C11orf68 Nerve KRT8P26 DRAP1 Ovary RP11-770G2.2 TSGA10IP Pancreas AP5B1 SART1 Pituitary AP001266.1 EIF1AD

Prostate OVOL1 BANF1

Skin OVOL1-AS1 CATSPER1 RP11-770G2.5 GAL3ST3 Skin CFL1 SF3B2 Spleen MUS81 snoU13 Stomach EFEMP2 Testis CCDC85B Thyroid FOSL1 Vagina RP11-1167A19.6 Whole Blood RP11-1167A19.2 Supplementary Figure 5. Tissue-wise expression of CTSW and nearby genes. The four eQTL genes shown in Figure 1B (MUS81, CTSW, FIBP, EIF1AD) are boxed. The tissue specificity of CTSW expression is clearly visible.

23

Supplementary Figure 6. The putative regulatory SNP in CTSW promoter in weak linkage with the GWAS SNP. (A) The CTSW promoter SNP is in high �′ but low � with the breast cancer GWAS SNP rs3903072, and it is a rarer SNP compared to rs3903072. Enlarging the CTSW promoter region shows that the promoter SNP is located near the center of the CTSW DHS in several blood cell lines. This is a zoomed-in region of Figure 1A where the immuno- inflammatory GWAS variants are marked magenta, with one more Rosacea SNP rs77779142 (Supplementary Table 1). (B) The linkage structure between the CTSW promoter SNP and the GWAS SNP. Most haplotypes carrying the rs658524-A allele have the rs3903072-G (risk) allele (data presentation from LDLink: https://ldlink.nci.nih.gov, computed based on 1000 Genomes phase 3 EUR population).

24 A BRCA UCEC n = 49 n = 240 n = 391 n = 36 n = 184 n = 280 -17 -11 5 p = 1.02×10 6 p = 1.50×10

5 4 4 3 3 2 2 2 log (FPKM+1) log (FPKM+1) 2

1 1

0 0 rs658524 rs658524 Genotype A /A A / G G/ G Genotype A /A A / G G/ G

6 HNSC LGG n = 30 n = 146 n = 290 n = 28 n = 155 n = 320 -12 2 .5 -6 5 p = 1.32×10 p = 1.43×10

2 .0 4

1 .5 3 2 2 2 1 .0 log (FPKM+1) log (FPKM+1)

1 0 .5

0 0 .0 rs658524 A /A A / G G/ G rs658524 A /A A / G G/ G Genotype Genotype

B GTEx Breast GTEx Whole Blood 3 n = 19 n = 44 n = 109 n = 25 n = 95 n = 193 p = 2.19×10-11 3 -5 2 p = 8.48×10 2 1 1

0 0

−1 −1

−2 −2 −3 GTEx Normalized Expression −3 GTEx Normalized Expression −4 rs658524 rs658524 A /A A / G G/ G A /A A / G G/ G Genotype Genotype Supplementary Figure 7. Violin plots for the eQTL analysis in different datasets, similar to Supplementary Figure 2 but using the genotypes at the CTSW promoter SNP. A linear model was constructed between the CTSW expression level and the genotype status at the promoter SNP rs658524; the p-values are for the linear coefficient of genotype, and gene copy number is not included in the model for this figure. Note that rs658524-A is marked as being the risk allele (red), as inferred through its haplotype structure with rs3903072. (A) eQTL analysis in cancers from TCGA, using ER+ breast cancer subtype in BRCA, endometrial cancer (UCEC), head and neck cancer (HNSC), and low grade glioma (LGG). (B) eQTL analysis in normal tissues from GTEx, using mammary tissue and whole blood tissue.

25 (Supplementary Figure 6B)

GTEx Whole Blood rs3903072 n = 25 n = 95 n = 193 3 Genotype 2 A

1 CTSW n = 37 n = 92 n = 64 0 −1 2 −2

−3 GTEx Normalized Expression 0 −4 CTSW rs658524 A /A A / G G/ G 2 Genotype

−2log (FPKM+1) p = 0.18

The haplotypes with rs658524-A rs3903072 risk/ protective/ protective/ risk risk protective mostly carries rs3903072-G (risk) Genotype GT_GWAS

-4 p = 6×10 2

0 2 CTSW

−2log (FPKM+1)

rs3903072 w/o w/ Genotype protective protective GT_GWAS Supplementary Figure 8. eQTL analysis conditioned on the protective homozygous genotype at the CTSW promoter SNP. We selected the samples carrying G/G at the promoter SNP, and studied whether there exists a residual effect from the distal GWAS genotype on the CTSW expression level. A recessive effect of the GWAS SNP is observed in the bottom plot. The p- value in the top right plot is for the genotype coefficient from the eQTL linear model, and the p- value in the bottom plot is computed using two-sided Welch’s t test.

26

Supplementary Figure 9. ChIA-PET data and anchoring elements around the GWAS region. The same region as in Figure 1A is shown, presenting DHS tracks, ChIA-PET data in the Jurkat cell line, putative regulatory elements, and anchoring elements that may mediate multi-way chromatin interactions. The putative regulatory elements (PRE), PRE1, PRE2, and PRE3, all contain GWAS-linked SNPs overlapping a blood cell DHS (Supplementary Table 5), but PRE1 is highlighted, because of its interaction with the 3’ end of CTSW. Several possible TF motifs affected by the PRE1 SNP, besides those displayed in Figure 3B, are shown here.

27

Train: GM12878 Train: HMEC Train: K562 Train: HUVEC Train: NHEK

Supplementary Figure 10. Predicted log contact counts for the pair between rs3903072-PRE1 and CTSW promoter in three cell lines: Natural Killer cells, CD8+ αβ T cells and vHMEC.

28 SUPPLEMENTARY REFERENCES

1. Michailidou, K., Hall, P., Gonzalez-Neira, A., Ghoussaini, M., Dennis, J., Milne, R.L., Schmidt, M.K., Chang-Claude, J., Bojesen, S.E., Bolla, M.K., et al. (2013). Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature genetics 45, 353-361e3612. 2. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., et al. (2017). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic acids research 45, D896- D901. 3. Malone, J., Holloway, E., Adamusiak, T., Kapushesky, M., Zheng, J., Kolesnikov, N., Zhukova, A., Brazma, A., and Parkinson, H. (2010). Modeling sample variables with an Experimental Factor Ontology. Bioinformatics (Oxford, England) 26, 1112-1118. 4. Michailidou, K., Lindström, S., Dennis, J., Beesley, J., Hui, S., Kar, S., Lemaçon, A., Soucy, P., Glubb, D., Rostamianfar, A., et al. (2017). Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92. 5. Grossman, R.L., Heath, A.P., Ferretti, V., Varmus, H.E., Lowy, D.R., Kibbe, W.A., and Staudt, L.M. (2016). Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine 375, 1109-1112. 6. Carithers, L.J., Ardlie, K., Barcus, M., Branton, P.A., Britton, A., Buia, S.A., Compton, C.C., DeLuca, D.S., Peter-Demchok, J., Gelfand, E.T., et al. (2015). A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project. Biopreservation and Biobanking 13, 311-319. 7. GTEx Consortium. (2013). The Genotype-Tissue Expression (GTEx) project. Nature genetics 45, 580-585. 8. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. (2002). The browser at UCSC. Genome research 12, 996- 1006. 9. Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong, A., Vrieze, S.I., Chew, E.Y., Levy, S., McGue, M., et al. (2016). Next-generation genotype imputation service and methods. Nature genetics 48, 1284-1287. 10. Loh, P.-R., Danecek, P., Palamara, P.F., Fuchsberger, C., A Reshef, Y., K Finucane, H., Schoenherr, S., Forer, L., McCarthy, S., Abecasis, G.R., et al. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nature genetics 48, 1443-1448. 11. Loh, P.-R., Palamara, P.F., and Price, A.L. (2016). Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics 48, 811. 12. Zhang, Y., Manjunath, M., Zhang, S., Chasman, D., Roy, S., and Song, J.S. (2018). Integrative Genomic Analysis Predicts Causative Cis-Regulatory Mechanisms of the Breast Cancer-Associated Genetic Variant rs4415084. Cancer Research 78, 1579-1591. 13. Therneau, T. (2015). A Package for Survival Analysis in S. version 2.38. 14. Kassambara, A., and Kosinski, M. (2017). Survminer: Drawing Survival Curves using 'ggplot2'. 15. Uhlen, M., Zhang, C., Lee, S., Sjöstedt, E., Fagerberg, L., Bidkhori, G., Benfeitas, R., Arif, M., Liu, Z., Edfors, F., et al. (2017). A pathology atlas of the human cancer transcriptome. Science 357, eaan2507.

29 16. Wu, C., Jin, X., Tsueng, G., Afrasiabi, C., and Su, A.I. (2016). BioGPS: building your own mash-up of gene annotations and expression profiles. Nucleic acids research 44, D313- D316. 17. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603. 18. The Fantom Consortium and the Riken PMI and CLST (DGT), Forrest, A.R.R., Kawaji, H., Rehli, M., Kenneth Baillie, J., de Hoon, M.J.L., Haberle, V., Lassmann, T., Kulakovskiy, I.V., Lizio, M., et al. (2014). A promoter-level mammalian expression atlas. Nature 507, 462. 19. The Encode Project Consortium, Dunham, I., Kundaje, A., Aldred, S.F., Collins, P.J., Davis, C.A., Doyle, F., Epstein, C.B., Frietze, S., Harrow, J., et al. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57. 20. Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M.A., Beaudet, A.L., Ecker, J.R., et al. (2010). The NIH Roadmap Epigenomics Mapping Consortium. Nature biotechnology 28, 1045-1048. 21. Barrett, T., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Holko, M., et al. (2013). NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research 41, D991-D995. 22. Hnisz, D., Weintraub, A.S., Day, D.S., Valton, A.-L., Bak, R.O., Li, C.H., Goldmann, J., Lajoie, B.R., Fan, Z.P., Sigova, A.A., et al. (2016). Activation of proto-oncogenes by disruption of neighborhoods. Science 351, 1454-1458. 23. Li, G., Chen, Y., Snyder, M.P., and Zhang, M.Q. (2017). ChIA-PET2: a versatile and flexible pipeline for ChIA-PET data analysis. Nucleic acids research 45, e4-e4. 24. Zhang, S., Chasman, D., Knaack, S., and Roy, S. (2018). In silico prediction of high- resolution Hi-C interaction matrices. bioRxiv, 406322. 25. Rao, S.S.P., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., et al. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665-1680. 26. Sherwood, R.I., Hashimoto, T., O'Donnell, C.W., Lewis, S., Barkal, A.A., van Hoff, J.P., Karun, V., Jaakkola, T., and Gifford, D.K. (2014). Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology 32, 171. 27. Schreiber, J., Durham, T.J., Bilmes, J., and Noble, W.S. (2018). Multi-scale deep tensor factorization learns a latent representation of the human epigenome. bioRxiv, 364976. 28. Whalen, S., Truty, R.M., and Pollard, K.S. (2016). Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature Genetics 48, 488. 29. Leong, W.Z., Tan, S.H., Ngoc, P.C.T., Amanda, S., Yam, A.W.Y., Liau, W.-S., Gong, Z., Lawton, L.N., Tenen, D.G., and Sanda, T. (2017). ARID5B as a critical downstream target of the TAL1 complex that activates the oncogenic transcriptional program and promotes T-cell leukemogenesis. Genes & development 31, 2343-2360. 30. Kulakovskiy, I.V., Vorontsov, I.E., Yevshin, I.S., Soboleva, A.V., Kasianov, A.S., Ashoor, H., Ba-Alawi, W., Bajic, V.B., Medvedeva, Y.A., Kolpakov, F.A., et al. (2016).

30 HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic acids research 44, D116-D125. 31. Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T.W., Greven, M.C., Pierce, B.G., Dong, X., Kundaje, A., Cheng, Y., et al. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome research 22, 1798-1812. 32. Pinello, L., Xu, J., Orkin, S.H., and Yuan, G.-C. (2014). Analysis of chromatin-state plasticity identifies cell-type-specific regulators of H3K27me3 patterns. Proceedings of the National Academy of Sciences of the United States of America 111, E344-E353. 33. Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., et al. (2006). TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic acids research 34, D108-D110. 34. Mathelier, A., Fornes, O., Arenillas, D.J., Chen, C.-Y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R., et al. (2016). JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research 44, D110-D115. 35. Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, Kazuhiro R., Rastas, P., Morgunova, E., Enge, M., Taipale, M., Wei, G., et al. (2013). DNA-Binding Specificities of Human Transcription Factors. Cell 152, 327-339. 36. Grant, C.E., Bailey, T.L., and Noble, W.S. (2011). FIMO: scanning for occurrences of a given motif. Bioinformatics (Oxford, England) 27, 1017-1018. 37. Crooks, G.E., Hon, G., Chandonia, J.-M., and Brenner, S.E. (2004). WebLogo: a sequence logo generator. Genome research 14, 1188-1190. 38. Sun, X.-J., Wang, Z., Wang, L., Jiang, Y., Kost, N., Soong, T.D., Chen, W.-Y., Tang, Z., Nakadai, T., Elemento, O., et al. (2013). A stable transcription factor complex nucleated by oligomeric AML1–ETO controls leukaemogenesis. Nature 500, 93. 39. Su, M.Y., Steiner, L.A., Bogardus, H., Mishra, T., Schulz, V.P., Hardison, R.C., and Gallagher, P.G. (2013). Identification of biologically relevant enhancers in human erythroid cells. The Journal of biological chemistry 288, 8433-8444. 40. Leinonen, R., Sugawara, H., Shumway, M., and International Nucleotide Sequence Database, C. (2011). The sequence read archive. Nucleic acids research 39, D19-D21. 41. Li, H., and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 26, 589-595. 42. Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based analysis of ChIP-Seq (MACS). Genome biology 9, R137-R137. 43. Mi, H., Huang, X., Muruganujan, A., Tang, H., Mills, C., Kang, D., and Thomas, P.D. (2017). PANTHER version 11: expanded annotation data from and Reactome pathways, and data analysis tool enhancements. Nucleic acids research 45, D183-D189.

31