Functional characterization of genetic variants in human disease Vincent Gant, Ph.D. Post-doctoral associate Rabin Lab

Nov. 16th 2018 Inherited genetic susceptibility to cancer

• Familial clustering of cancer has long been observed and indicative of inherited susceptibility to cancer • Epidemiological studies in the 1940s  3-fold increase in the risk of most common cancers in first-degree relatives of patients.1 • Twin studies  a higher concordance in monozygotic twins compared with dizygotic twins for many common cancers.2 • Inherited genetic susceptibility makes a significant contribution to the familial clustering of cancer as opposed to shared life-style or environmental risk factors.

1. Houlston R, Peto J. Genetics and the common cancers. In: Eeles R, Ponder BA, Easton DF, Horwich A (eds). Genetic Predisposition of Cancer, Vol. 1. London: Chapman and Hall, 1996. 2. Lichtenstein P, Holm NV, Verkasalo PK et al. Environmental and heritable factors in the causation of cancer—analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med 2000;343:78–85. Inherited genetic susceptibility to cancer

• Family-based linkage studies and positional cloning in the 1980s successfully identified high-risk loci. • Genes of note include: • BRCA1 and BRCA2 associated with risk of breast and ovarian cancers.3,4 • APC, MHL1, and MHL2 associated with colorectal cancer risks.5-7 • CDKN2A associated with melanoma and pediatric acute lymphoblastic leukemia risk.8,9

3. Hall JM, Lee MK, Newman B et al. Linkage of early-onset familial breast cancer to chromo- some 17q21. Science 1990;250:1684–9. 4. Wooster R, Neuhausen SL, Mangion J et al. Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12–13. Science 1994;265:2088–90. 5. Bodmer WF, Bailey CJ, Bodmer J et al. Localization of the gene for familial adenomatous polyposis on chromosome 5. Nature 1987;328:614–6. 6. Peltomaki P, Aaltonen LA, Sistonen P et al. Genetic mapping of a locus predisposing to human colorectal cancer. Science 1993;260:810–2. 7. Lindblom A, Tannergard P, Werelius B et al. Genetic mapping of a second locus predispos- ing to hereditary non-polyposis colon cancer. Nat Genet 1993;5:279–82. 8. Cannon-Albright LA, Goldgar DE, Meyer LJ et al. Assignment of a locus for familial mela- noma, MLM, to chromosome 9p13-p22. Science 1992;258:1148–52. 9. Sherborne et al. Variation in CDKN2A at 9p21.3 influences childhood acute lymphoblastic leukemia risk. Nat Genet 2010;42:492-94. Cancer susceptibility genes (CSGs)

• Most of these genes are CSGs with very high penetrance • CSGs conform to the two-hit model of cancer susceptibility, or Mendelian predisposition • See BRCA1 and BRCA2 • As of 2017, more than 70 CSGs associated with high-penetrance cancer susceptibility syndromes have been identified.10 • Although the risk of cancer conferred by these loci is substantial, a significant proportion of excess familial risk is unexplained.

10. M. L. et al. Monogenic and polygenic determinants of sarcoma risk: an international genetic study. Lancet Oncol. 17, 1261–1271 (2016). Cancer susceptibility genes (CSGs)

• High-penetrance are responsible for most breast cancer cases in families with more than three affected individuals  Mendelian inheritance • However, these mutations are responsible for only a minority of cases in families with two affected individuals.11–12 • Mutations in known predisposition genes including BRCA1/2 constitute <25% of the 2-fold excess risk in the relatives of patients with breast cancer • Where’s the rest come from?

11. Peto J, Collins N, Barfoot R et al. Prevalence of BRCA1 and BRCA2 gene mutations in patients with early-onset breast cancer. J Natl Cancer Inst 1999;91:943–9. 12. Anglian Breast Cancer Study Group. Prevalence and penetrance of BRCA1 and BRCA2 mutations in a population-based series of breast cancer cases. Br J Cancer 2000;83:1301–8. Cancer susceptibility genes

• The findings from the twin studies, and the fact that we have not identified more genes with similar profiles to BRCA1/2, suggests most of the missing is polygenic • Co-inheritance of medium-penetrance mutations and low-penetrance variants contribute to cumulative heritable risk for cancer in the population Genetic architecture of cancer risk

Adapted from Sud et al. Nat Reviews, 2017 Single nucleotide polymorphisms (SNPs)

• SNPs are variations at single positions in the genome between individuals • If >1% population exhibits a different base at a single position it is considered a SNP • SNPs vs mutations • SNPs • Common genetic variation in the genome • Can occur anywhere in the genome and are not usually pathogenic • Heritable • Mutations • Acquired changes to the genetic sequence • Mutations can be induced by errors during DNA replication, exposure to mutagens or radiation (ionizing radiation and UV light) • Many mutations are somatic, but some mutations can be heritable if they occur in the germ line Single nucleotide polymorphisms (SNPs)

• SNPs might not always cause a disorder, but they can be associated with certain diseases • These associations can be studied to understand the underlying mechanism of an individual’s genetic predisposition • SNP are connected by Linkage disequilibrium

• LD is the non-random association of alleles at different sites in a given population • Alleles in high-LD have a higher probability than random chance of co-migrating and being inherited together • During meiosis prophase I • Pairs of homologous chromosomes undergo synapsis • Chromosomal material is exchanged during recombination • Whole blocks of genes are crossed • Whole blocks of SNPs are crossed Linkage disequilibrium

• This mechanism allows groups of clustered SNPs to travel together as part of blocks, which have a probability > random chance of being co-inherited The genome-wide association study (GWAS)

• GWAS is an agnostic case-control study comparing genetic variation between people with a particular condition and unaffected individuals • SNPs forming haplotypes across the genome • GWAS have been completed for many diseases or phenotypes • Abdominal aortic aneurysms • Acute lymphoblastic leukemia • Body mass and obesity • Celiac disease • Parkinson’s disease • White matter integrity Adapted from Sud et al. Nat Reviews, 2017 Example of GWAS data in a Manhattan plot

rs10761598 ARID5B The genome-wide association study (GWAS)

• Made possible by advances in… • Advances in sequencing technologies • Our understanding of SNPs and LD structure • Imputation of untyped genotypes using sequence reference panels from 1000 Genomes Project UK10K consortium, and haplotype reference consortium • Underlying basis is the non-random association of alleles, LD Cancer risk loci identified by GWAS

• Over the last decade multiple cancer risk loci in the European population have been identified in GWAS • Cancer risk loci specific to East Asian and African American populations have also been identified • As of 2017, over 430 cancer associations at 262 distinct genomic positions have been discovered in GWAS • Breast and prostate cancers have yielded the most risk loci Clinical relevance of GWAS

Adapted from Sud et al. Nat Reviews, 2017 Here’s the GWAS data!

rs10761598 ARID5B

Adapted from Trevino et al. Nat Genetics, 2009 What is the data trying to show us?

• Alternate allele of SNP 1 (risk allele) is significantly enriched in patients rs10761598 Taking a closer look at the data

rs10761598

• Tag SNPs are detected in cases and controls using massive genotyping arrays • Tag SNPs have been previously identified and selected to serve as proxies for nearby SNPs based on LD studies • Tag SNPs will direct an investigators attention to the region Taking a closer look at the data

NR allele is not associated

R allele is associated Individual 1 ATCGTCGACGACTGACGTCTCATGAACGATCGTCATGCTTTACCAGCTCAGCATTATTGCATGCCGCTAGCTGACGCTATGATTAGCATGCA

Individual 2 ATCGTTGACGACTGACGTCTCATGCACGATCGTCATGCTTTACTAGCTCAGCATTATTGCATGCCACTAGCTGACGCTATGATTAGCNATGCA Taking a closer look at the data

• LD structure of each allele reveals more differences than a single base • At a haplotype scale, variation between alleles results in significant sequence differences • The consequences of variation on basic genetic processes Individual 1 are manifold ATCGTCGACGACTG CGTCTCATGAACGATCGTCATGCTTTACCAGCTCAGCATTATTGCATGCCGCTAGCTGACGCTATGATTAGCATGCA

Individual 2 ATCGTTGACGACTGACGTCTCATGCACGATCGTCATGCTTTACTAGCTCAGCATTATTGCATGCCACTAGCTGACGCTATGATTAGCNATGCA Motivation

• GWAS highlight an associated region of the genome, or some set of haplotypes, using a tag SNPs • This data will either identify new genes/pathways in disease or confirm those previously characterized • We can use GWAS data to study a stretch of DNA near the SNP cluster to identify genes and regulatory regions responsible for contributing to disease risk Motivation

• Majority of tag SNPs identified in GWAS map to non-coding regions of the genome • We can use data from ENCODE to annotate non-coding regions in order to predict function of a disease-associated non-coding variant • The tag SNP may not be causal, but it can help us identify associated variants that are in LD – challenging Identifying associated variants

• Haploreg and RegulomeDB • https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php • http://www.regulomedb.org/ • Starting points to consolidate haplotype data that is linked to ENCODE data • Identify haplotype blocks of associated variants • Begin to functionally annotate the region around your risk allele • Make use of tools that exploit haplotype structure of the genome Conceptual example – chromatin states

Lead tag SNP

• Highly significant association of a SNP with a particular disease • Maps to an intergenic region with no obvious or previously studied function • Functional annotation of the region would help • Promoter? • Enhancer? Conceptual example – chromatin states

Lead tag SNP

• Chromatin state data from a relevant cell line can inform on a potential mechanism underlying the association • Enhancer – decreased gene activation, factor binding, chromatin looping • Promoter – decreased xc • But what if the SNP still fails to hit?

Cell line 1 Cell line 2 Cell line 3 ? Conceptual example – chromatin states

Associated SNPs

• HaploReg identifies SNPs in LD with tag SNPs • Increases your repertoire of hits • Some might overlap with regions of functional importance

Cell line 1 Cell line 2 Cell line 3 $ $ $ $

Haplotype block HaploReg – Annotating linked variants

• Navigate to: https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php • Google: HaploReg • Click 1st link – Broad Institute • Type into Query: rs10761598 • Click submit • Scroll to and click on SNP rs7090445 https://pubs.broadinstitute.org/mammals/haploreg/detail_v4.1.php?query=&id=rs7090445 Potentially a regulatory region in hematopoietic tissues Proteins relevant to transcription and immune cell signaling bind in immortalize B cells

Evidence for MEF2 binding at this SNP

SNP was previously identified in a GWAS

Alt allele possibly alters a motif for MEF2… HaploReg

• Using HaploReg we identified a group of SNPs that may act as the causal variants and indicate potential experiments for the future • To gain a clearer picture of the association signal and the identified associated variant we will functionally annotate the region using a genome browser that has access to epigenetic data from ENCODE WashU EpiGenome Browser – Functional annotation of a

• Navigate to: http://epigenomegateway.wustl.edu/legacy • The old version of the browser – the one I know how to use! • https://epigenomegateway.wustl.edu/ • This is the New version of the browser - the one I don’t know how to use… because they updated it last night… while this presentation was being made • Google: WashU Epigenome • Click 1st link

Load data tracks • Encyclopedia of DNA Elements (ENCODE) data • Long-range chromatin interaction experiments

Right side (Cells and tissues) • Expand Adult cells/tissues • Expand Blood • Expand Lymphocyte > GM12878 is Tier 1 cell line, EBV X-formed B cell Top (Assay tracks) • Expand epigenetic mark • Expand histone mark • Expand H3 • Expand H3K27 and load “Bernstein GM12878 H3K27Ac” track • Expand H3K4 and load “Bernstein GM12878 H3K4me1” track and “Bernstein GM12878 H3K4me3” track • Expand Other epigenetic mark • Under DNase I hypersensitivity, load “Duke DnaseSeq GM12878” • Under FaireSeq, load “UNC FaireSeq GM12878” • Under chromHMM, load “GM12878 chromatin state” • Expand Transcription Regulator • Expand Other Transcription Regulator • Under CTCF, load “Bernstein GM12878 CTCF” track • Under Pol2, load “UT-A ChipSeq GM12878 Pol2” track • Under MEF2C, load HusdonAlpha ChipSeq GM12878 MEF2C” track • Expand Transcription Factor Binding Sites • Under NFKB, load “Stanford ChipSeq GM12878 TNFa” • Expand Long Range Interaction • Under ChIA-PET, load “ChIA-PET GM12878 CTCF” – make sure it is from Tang et al by clicking info button “i” Reorganize tracks (personal preference) • Click the X to close the tracks matrix, search “chr10:63721176”, highlight SNP, zoom out 5 fold x 5 times, click and drag around the gene to zoom in • Delete RepeatMasker track • Click the gear > Expand Track name width 5x • Genes • Chromatin state • DNase-seq • Faire-seq • Histone markers • TFs • Chromatin remodeling • ChIA-PET – right-click and click “Full” Resize tracks • Holding SHIFT, left-click each track, it will turn yellow when selected • Release shift • Right-click one of the selected tracks • Click configure • Adjust height until histograms are shown (click Big “+” 3x) • Select Smooth window option • Right-click one of the yellow highlighted track names • Click cancel multiple select • Right-click single tracks to adjust their display individually to taste (color, size, background color, smoothness etc) • Save yourself the headache, use this session: vGWjEsoHE5 > Ginger the chicken Analyze the data • SNP maps to a region exhibiting strong enhancer qualities • Peaks for DNase-seq and FAIRE-seq - open, accessible chromatin • H3 modifications that mark enhancers and open chromatin • SNP overlaps binding of MEF2C and NFKB within an enhancer • Variation at this SNP could modulate binding by these factors • Enhancers interact with their cognate promoters through long-range chromatin interactions that are insulated • CTCF and SMC3 peaks flank the SNP • These proteins are part of the CTCF-cohesin complex that mediates chromatin structure for insulation • Might the SNP be involved in long-range chromatin interactions? A construct of data

• These lines of evidence suggest regulatory mechanisms by which the SNPs from this GWAS may affect an individuals risk for ALL • While individually each piece of evidence is relatively weak, they offer ways in which molecular biologists could proceed with further experiments that would more definitively establish mechanisms • For example, the clustering of enhancer markers around the SNP in hematopoietic tissues suggests global differential ARID5B regulation in bone marrow HSCs, which has been associated with ALL in many studies and suggests a tissue to study gene expression directly in animal models. • ChIP-seq and motif data suggest specifically testing MEF2C, NFKB, and Pol2 binding differentially to the alleles of rs7090445, and the strong motif data suggest looking at whether HOX binds differentially to rs7090445 in a BM HSC model. • Finally, ChIP-seq and chromatin state data suggests experiments to dissect the mechanism of ARID5B differential expression, perhaps modulated by MEF2C or NFKB at the enhancer marked by rs7090445 and suggests that it may be useful to perform ALL-relevant proliferation and differentiation assays in BM HSCs of mice with either MEF2 or NFKB KD or CRISPR KO Examples of experiments

• Luciferase reporter assay (Dual luciferase reporter assay, Promega) • Clone the risk and non-risk variants of a regulatory region into a luciferase expression vector • Transfect vectors into relevant cells • Lyse cells and assess luciferase expression driven by the regulatory element indirectly by measuring luminescence Examples of experiments

• Electrophoretic mobility shift assay (EMSA) • Mix nuclear protein with fluorescent oligos encoding the R and NR variants of a regulatory region • Include an antibody to do a super-shift if you suspect specific factor binding • Resolve DNA-protein complexes on a gel and observe on a fluorescence imager (Li-Cor) Conclusion

• GWAS are agnostic studies that identify germline variants that are associated with complex human diseases • Leaving the days behind where we focus on single harmful mutations • Entering a time where we are beginning to understand the underlying complex genetic programming of disease etiology • In addition to being massive accomplishments when completed, GWAS also serve as the springboards to functionally characterize mechanisms of heritable risk for disease • LD and haplotype blocks are key to deciphering GWAS data • HaploReg should be your go-to tool to expand your understanding of a particular risk locus • WashU EpiGenome Browser is indispensable for gaining a clearer picture of the structure and function of the risk loci • Use these tools to develop testable hypothesis and relevant tissue and animal models