Genetic and Genomics Laboratory Tools and Approaches

Meredith Yeager, PhD Cancer Genomics Research Laboratory Division of Cancer Epidemiology and Genetics [email protected]

DCEG Radiation Epidemiology and Dosimetry Course 2019

www.dceg.cancer.gov/RadEpiCourse (Recent) history of genetics

2 Sequencing of the Genome

Science 291, 1304-1351 (2001)

3 The – 2019

• ~3.3 billion bases (A, C, G, T) • ~20,000 -coding , many non-coding RNAs (~2% of the genome) • Annotation ongoing – the initial sequencing in 2001 is still being refined, assembled and annotated, even now – hg38 • Variation (polymorphism) present within – Population-specific – Cosmopolitan

4 Types of polymorphisms

. Single nucleotide polymorphisms (SNPs) . Common SNPs are defined as > 5% in at least one population . Abundant in genome (~50 million and counting)

ATGGAACGA(G/C)AGGATA(T/A)TACGCACTATGAAG(C/A)CGGTGAGAGG . Repeats of DNA (long, short, complex, simple), insertions/deletions . A small fraction of SNPs and other types of variation are very or slightly deleterious and may contribute by themselves or with other genetic or environmental factors to a phenotype or disease

5 Different mutation rates at the nucleotide level

Mutation type Mutation rate (per generation) Transition on a CpG 1.6X10-7 Transversion on a CpG 4.4X10-8 Transition: purine to purine Transition out of CpG 1.2X10-8 Transversion: purine to pyrimidine Transversion out of CpG 5.5X10-9 Substitution (average) 2.3X10-8 A and G are purines Insertion/ (average) 2.3X10-9 C and T are pyrimidines

Mutation rate (average) 2.4X10-8

. Size of haploid genome : 3.3X109 nucleotides . 80 new mutations per haploid genome per generation. . Assume 2% of genome under natural selection . About 1.6 new deleterious mutations in each gamete . Most of these deleterious mutations are recessive Nachman & Crowell, Genetics 156:297-304 (2000)

6 Confusing terminology, SNPs versus mutations

. Germline versus somatic . Inherited versus acquired . Common versus rare germline . Common variants are generally called ‘polymorphisms’ . However, all polymorphisms started as germline mutations (de novo mutations) . Singletons or exceedingly rare SNPs??

7 Most SNPs are rare

(n = 67,135,025 SNPs from 1000 Genomes)

8 7X106 Estimated number of common SNPs in the human genome as a function of their minor allele frequency 6X106 Estimated number of SNPs 5X106

4X106

3X106

2X106

106

0

>5% >10% >15% >20% >25% >30% >35% >40% >45% Minor Allele Frequency (MAF) Adapted from Reich et al. Nat Genet (2003) 9 SNPs & function . Vast majority are “silent” . No known functional change . Alter function of product . Change sequence of protein . Alter /regulation . /enhancer . mRNA stability . Small RNA binding sites . Disrupt CpG site

10 11 Which variants contribute to disease?

. Loss of function – premature stops, frame-shifts . Splice variants . changes . SNPs in regulatory regions . Structural variants . ?? . There are tools that help predict which variants might be important

12 Mapping Genetic Susceptibility

13 Trajectory of the Genetics of Cancer Susceptibility

Pre 2001 ~2004

Very few candidate genes panned out • Underpowered studies? • Bad guesses? Candidate SNP Functional Data

Candidate Genes Biological Plausibility Genetic Markers

Candidate Pathway Biological Plausibility 14 Technological advances + community efforts

15 16 17 Trajectory of the Genetics of Cancer Susceptibility (2)

Pre 2001 ~2004 2006

Candidate SNP Functional Data Genome Wide Association Studies (GWAS) Candidate Genes Biological Plausibility Genetic Markers

Candidate Pathway Biological Plausibility 18 GWAS

. Somewhat randomly-distributed SNPs are genotyped as “markers” . Capitalize on phenomenon known as linkage disequilibrium (non- random association of alleles at different loci) . “Agnostic”

19 Basic principle of genetic association studies in unrelated individuals

20 Published Cancer GWAS Etiology Hits: June 2019

TARDBP AJAP1 NOC2L ITPR3 NOTCH4 6p25.2 PEX14 RCC2 KIF1B EXOC2 UHRF1BP1HLA-DOA ATAT1 TRIM31 6p21.31HLA-G >1,100 Disease Loci marked by SNPs SERPINB6BTNL2 HCP5 6p21 6p21.1 KLHDC7A M DS2 WNT4 STG GPX6 DDX1 HLA- HLA- 2p25.1 HLA-DQFLJ22536 GRM 4 LACE1 1p34.3 DRB5 HLA DQB1 C1orf200 GNL2 2p25.2C2orf48 For > 30 “cancers” RSPO1 NOL10 TNF BAK1 C6orf106RREB1HLA-DRA HLA-A HLA-DRB6 PIK3R3 1p34.2 HIVEP3 M YCN GRHL1 GDF7 EGOT 3p26 M ICA 6p21.32 M ICB SOX4C6orf10DTNBP1CYP21A2 2 Loci marked by a CNV ADCY3 2p24.1 DAZL HLA- HLA- 1p32.3 FAF1 DTNB FOXQ1 HLA-F HIST1H1E DRB1 DPB2 GABBR1 GPN1 2p23.3 2p24 TGFBR2 SLC4A7 EOM ESITGA9 RAVER2 L1TD1 5p15.33 EXOC2 IRF4 RANBP9 C11orf21 TSPAN32 OR52N2 WDR43NCOA12p22.3 MAD1L1 CTNNB1ULK4 3p22.1 TERT/ CDKAL1PSORS1C1 CASC15 GATA4 AK5 QPCT EPAS1CYP1B1 CLPTM1L 5p15 AGR3 LSP1 PIDD1 LMO1 3p21.31 TACC3 HAUS6 CCND2 DENND5B UNC5CL AIF1 CCHCR1 ITGB8 DMRT1 IGF2 PPFIBP2 RNU6-491P NRXN1 EHBP1 THADA 5p15.1 AHRRTPPP SP8 AHR 8p23.3 NKX3-1 8p21 LARP4B MOB2 LM O4 1p22.3 CPZ 12p13.31 DOCK3 6p22.1 ATXN1 NEU1 DNAH11 DMRTA1 SPSB2 WNK1 REL 2p14 MEIS1-AS3 CREB5 SP4 EBF2 BNC2 9p21.3 LOC338591 5p15.33BASP1ELL2 HLA- NAT2 INTS10 GATA3 10p14 ETAA1 FHIT HLA-B 6p22.2 7p15 7p14.1 CDKN2A 12p13.1 ETV6 GSTM 1 ZNF638BCL11A 5p14.3 DAB2 5p13.1 DQA2 7p13 CHRNA2 CDKN2A AS1 WT1 CDKN1B BCL2L15 KCNIP4 HLA- 8p12 TNFRSF10 /B PIP4K2A DNAJC1 NEBL SLC25A26 LCORL 5p13.3 PRKAA1 TFEB CDKN1A IKZF1 TNS3 KIAA1551 ATF7IP deletion PPP1R21 FOXP1 SUGCT WAC SSPN GGCX 4p14 DQA1 8p11.23 NRG1 MTAP RN7SKP114 NAMPTL HSD17B12 11p11.2 FGF10 5p12 SLC45A2 6p12.1 6p21.32 EGFR 7p11.2 12p11.22 ITPR2 EM BP1 1p12 VGLL3 3p11 3p12 HPVC1 STK38L FADS1 EXOC1 COX18 KHDRBS2 OR8U8 11q12 10q11.23 PAX8 CM SS1 WDR52-AS1 CWC27 RAB3C PDE4D 10q11 2q13 2q14.1 CDKL2 SMARCAD1 11q13.1 MYRF INCENP KRT5 DIP2B ACVR1B DUBR 6q14.1 MYO6 ZNF365 BCL2L112q14.2 ACOXL ARID5B 10q21.2 HELQ HEL308 PDLIM5 5q11.2 5q11.1 DDX4 ABCB4ABCB1LMTK2 KRT8 PRPH 12q13 RNF115 OTUD7B NBPF23 ZBTB203q13.2 PRDM14 EGR2 11q13.3 POLD3 CCND1 TFCP2L1 ADH1B ZM Z1 AS1 ADAM15 GOLPH3LDENND4B 4q22.1 ADH6 5q12.3 12q1212q13.312q24.12 SLO2A1 3q22.1 HNF4G 8q21.13 NKX2-3 TYR 11q14.1 11q13 ATG10 ARRDC3 XRCC4 HACE1 LIN28B CHMP4C FOXE1 9q22.33 9q31.1 ATP1B3 4q24 CENPE 4q22.2 CUX1 7q21.3 ABCC2 FFAR4 RASSF3 RNU6-148P12q15 TRIM 46M UC1 ARNT NCK1 3q21 ATG5 TGFBR1 PLCE1 ATM CXCR5 KDM2A TET2 BANK1 5q14.3 ADGRV1 5q15 GPRC6A 8q22.3 9q31.3 LEF1 OLIG3 RGS22 ODF1 ENTPD7 12q21.31 12q21.2 M TX1ASH1LDUSP12 ZEB2 ZBTB38 3q23 3q25.31 10q23.33 FAS POLD3 11q22.1 POU2AF1 NR1H4 CAMK2D ECHDC1PTPRKROS1 ZFHX4 ZBTB10 ZFPM3 9q31.2 TMEM38B SYNPO2 NREP POT1 ASZ1 9q34.2 10q25.1 10q24.33 RNLS KDELC2 TMPRSS5 C11orf93 ELK3 NTN4 POLR3B CLCN6SLC25A44 ITGA6 2q24.2 M BNL1 RSRC13q25.31 ALDH7A1 LINC00536EIF3H POU5FIP1 TCFL2 CYP17A1 4q28.1 L3MBTL3 AHI1 NEDD9 KLF14 FLJ43663LINC-PINT ASTN2LM X1B9q33 OBFC1 PHLDB1 SH2B3 ACAD10 BRAP UCK2 HOXD1HOXD32q31.1 M YNNLRRIQ4M DS1 11q23.3 DLG2 PITX1TMEM173 IL3 CCAT2 CCDC26 VTI1A TCF7L2 10q25.2 NR5A2 LRRN2 SLC41A1 LOC643623HEY2 HBS1L PVT1 LAMC3 11q24.1 GRAMD1BMPZL2 TBX3 CUX2 C12orf51 CDCA7 DLX2-AS1 TNFSF10 3q26.31 TERC 4q31.21 HSPA4 CSNK1A1 RNU6-456P CASC19POU5F1B SEC16A TOR1A 10q26.12 FGFR2 PRLHR NOS1 PHLDA3 ZC3H11A ESR1 6q25.1IPCEF1 NOBOX 8q24.21 ABO 11q24.3 SCN3BMMP20 OAS3 TBX5 MDM4 CLDN11 PPP2R2B HTR3CST6GAL1 CATSPER3SPRY4 CNNM2 TCF7L2 NABP1 STAT4NUP35 FSTL54q32.3 SLC22A3 SLC22A2 PSCA ANXA13 8q24.13 LHPP MMP8 12q24.31 VTI1A SCARB1 LGR6 ESRRG DUSP10 M ECOM TAB2 SMARCD3 NCAPG2 ETS1 C11orf30 EBF1 10q26 LHPP KITLG ALS2CR12 CASP8HIBCH TP63 LPP ADAM29 CCAT1 8q24 ATM PKNOX2 SH2B3 ALDH2 PARP1 TPRG1 MIR3939 RNASET2 RGS17 HTR3B RHOU 2q33.1 C3orf21 12q24.31 TBX5 12q24.21 PCNXL2 ERBB4 PNKD 3q28 B3GAT1 GAB2 ZBTB16 GPR160 IRF2 ZFP42 5q35.1 BOD1 6q27 KATNA1 BARD1 LRRC34 DOCK2 FBRSL1 MSRB3 EXO1 2q35 DIRC3 COL23A1 SATB2 EFCAB2 2q36.3 SP140

ZNF257 ZNF728 FARP2 BOK-AS1 MLPH ZNF726 SHROOM2 GPR143 CLEC16A NFIX1 MYO9B TCF3 C20orf54 BM P2 ADCY9 LM F1 ZBTB7A MPG NXN SM G6 CHD3 PLIN3 TSPAN16 TGM 3 MCM8 HAO1 CIITA MPV17L BCAR4 MERIT40 19p13.12 19p13.11 20p12.2 20p12.3 TP53 17p13.1 DBIL5P CLUL1 JAG1 TNFRSF13B 19p12 20p11.22 MAPK3 PIRT PTPN2 FAM38B GATAD2A ELL KIZ NUDT10 XAGE3

GPATCH1 RHPN2 TNFRSF19 IKZF3 ATAD5HNF1B CHST918q11.2 GAREM CCNE1 RALY BP1FB1 CDC91L1 22q12.3 M IPEP PDX1 CEBPE MMP14NKX2-1 OCA2HERC2 GREM1 NRIP1 GRIK1 XBP1 MTMR3 AR 17q11.2 TCF2 17q12 KCCN4 GIPR 16q12.1 TOX3 FTO SLC14A1 PRKD2 20q13.12 PREX1 LAMA5 BRCA2 FRY PAX9 ATG14 BM F CRAC1 15q22.33 CDH2SM AD7 21q22.12 RUNX1 BACH1 CBX6 EMID1 CHEK2 NLGN3 SLC7A M BIP CNTNAP1 GSDMB HAP1 CYP2A6 IFNL319q13.2 13q13.3 SEMA6D CASC5 CHRNA5 16q12.2 AM FR HEATR3 GAREM1 MBD2 20q13.13 RPS21 20q13.33 TFF1 22q12.1 ZNRF3 TBX1 KANSL1 COX11 GRP THEG5 LINC00111 MX2 DDHD1 SIX1 RFX7 17q21.31 TGFB1SYM PK BMP4 RFX7 CYP19A1 CDH1 RFWD3 GPR56 SETBP1 RTEL1 CYP24A1 DNMT3B M KL1 PLA2G6 XRCC6 DCC M BD1 ATP5SL KLK3 TMPRSS2 MCM3APAS ESR2 TTC9 17q21.32 BRIP1 NGFR 19q13 14q23.1 CHRNA3 MAP2K1 PRTG PHLPP2 16q23.3 ZFPM1 ADNP 22q13.31 13q22 PMAIP1 18q21.1 LILRA3 POU2F2 19q11 GTPBP5 STMN3 SCO2 SLC16A8 RAD51B DPF3 BPTF ZNF652 RNF43 BCL2 CLK3 PCAT29 RPLP1 CDYL2 M C1R IRF8 BCAS1 22q13.1 CBX7 TTC28 13q22.1 KLF5 KLF12 TSHZ1 TCF4 OACYLP NLRP12 19q13.43 20q13.2 20q13.13 17q25.3 LINC00673 17q24 PRC1ANP32A 16q24.2 22q13 CCDC88C RIN3 ETFA FOXL1 16q24.1 SALL3 GNAS NCOA5 20q11.21 FAM19A5 AIFM3 SLC28A1SM AD3 DEF8 BCAR1 TEX14 SKAP1 EVI2A UBAC2 ADSSL1 16q23.2 AKT1 17q24.3 M CF2LUPF3A IRS2 13q34 TKTL1

36 Basal Cell 17 Bladder 184Breast 15 Cervical 81 CLL 122Colorectal 15Cutaneous Sq 22 Endometrial 29 Esophageal 6 Ewing Sarcoma 3 Gallbladder

14 Gastric 38 Glioma 29 Hodgkins 13 Kidney 7 Liver 48 Lung 22 Melanoma 103 Multiple 23Multiple Myeloma 14 Nasopharyngeal 18 Neuroblastoma 15Non-Hodgkin 10 Oral 2 Osteosarcoma 39 Ovary 28 Pancreas 16 Pediatric ALL 170Prostate 60 Testicular 17 3 Wilms

21 Trajectory of the Genetics of Cancer Susceptibility (3)

Pre 2001 ~2004 2006 2010

Whole Genome Sequencing Candidate SNP Functional Data Genome Wide Association Exome Studies (GWAS) Sequencing Candidate Genes Biological Plausibility Genetic Markers

Candidate Pathway Biological Plausibility Regional Sequencing GWAS & Linkage 22 23 24 25 https://gnomad.broadinstitute.org/

125,748 exomes 15,708 whole genomes

26 Examples: Genetic studies

. Genome-wide association studies – use LD to find SNPs associated with a trait or disease. . Most GWAS “hits” intergenic . Most associations relatively weak (OR ~ 1.2) with common SNPs . Culprit probably a variant in a regulatory region? . Exome sequencing studies – sequence the protein-coding regions of the genome . Rare variant of strong effect

27 Examples: Genetic studies, continued

. Target-gene(s) sequencing to study only a particular gene or set of genes . Example: BRCA1/BRCA2 . Regional sequencing of a contiguous part of a . Example: GWAS follow-up . Whole genome sequencing . Look everywhere in the genome; structural variation detection

28 NGS Data Analysis: Like panning for gold

Generate high-quality raw DATA sequences

Reassemble reads to so they represent underlying INFORMATION biology more closely

Enable hypothesis KNOWLEDGE & generation and testing UNDERSTANDING (HOPEFULLY) tailored to specific scientific questions

Golden Helix: A Hitchhiker's Guide to Next Generation Sequencing http://gettinggeneticsdone.blogspot.com/2011/05/golden-helix-hitchhikers-guide-to-next.html 29 “Data Cleaning”: A Process of Information Losing

Have we got ALL the right variants? Are we really discarding only dirt and not gold?

Sequencing

Mapping

Variant Calling The Hidden Truth Filtering

Association Our Discovery

Variant Functional Larger-population Validation Studies Follow up Have we got the TRUE variants?

Data management and quality assurance along the WHOLE process is the key to success! 30 Other Challenges . Computational requirements . Sequence data storage (many terabytes of data) . Raw trace base calling and genomic mapping . Variant detection and annotation . In silico functional prediction . Libraries of in-house and public variation – when to use and how?

31 More things to consider

. Study design – laboratory point of view is also important . Case / control . Distribute cases and controls evenly on each plate . Perform same assay on cases and controls . Genetic ancestry is important

32 Population Structure (Admixture)

33 Many SNPs have different allele frequencies in different populations

34 “The philosophy is straightforward: The more easily smart people can see the data, the more likely they are to make discoveries that can benefit us all.”

https://www.npr.org/sections/health-shots/2019/09/06/755402750/how-sho uld-scientists-access-to-health-databanks-be-managed 35 36 Conclusions

. Technological advances and community efforts have allowed for many ways to affordably approach our questions of disease . Multiple technologies and assays are available in order to best explore biological questions . There are ever-growing issues in analyzing and storing large datasets . Study design from a laboratory point-of-view is (and always will be) important . Data sharing is important for the advancement of science

37 U.S. Department of Health & Human Services National Institutes of Health | National Cancer Institute

dceg.cancer.gov/

1 - 800- 4 - CANCER Produced September 2019