Transcriptomics-Guided Design of Synthetic Promoters for a Mammalian System
Total Page:16
File Type:pdf, Size:1020Kb
Transcriptomics-guided design of synthetic promoters for a mammalian system Joseph K. Cheng1 and Hal S. Alper1,2,* 1Department of Chemical Engineering, The University of Texas at Austin, 200 E Dean Keeton St. Stop C0400, Austin, Texas 78712 2Institute for Cellular and Molecular Biology, The University of Texas at Austin, 2500 Speedway Avenue, Austin, Texas 78712 *Correspondence and requests for materials should be addressed to H.S.A. ([email protected]) Supporting Information Accession codes of nucleotide sequences used in this manuscript Human cytomegalovirus: M60321.1 Simian Virus 40: J02400.1 Human chromosomes, GRCh38.p2 assembly: NC_000001.11, NC_000002.12, NC_000003.12, NC_000004.12, NC_000005.10, NC_000006.12, NC_000007.14, NC_000008.11, NC_000009.12, NC_000010.11, NC_000011.10, NC_000012.12, NC_000013.11, NC_000014.9, NC_000015.10, NC_000016.10, NC_000017.11, NC_000018.10, NC_000019.10, NC_000020.11, NC_000021.9, NC_000022.11, NC_000023.11, NC_000024.10 Equations From the derived GMM, we can determine the probabilities of several key concerns: 1) the false positive probability at a particular threshold expression value (log-transformed) of belonging to the high expression group; 2) the false negative probability at a particular threshold expression value (log- transformed) of belonging to the high expression group; and 3) the probability of observing an expression value (log-transformed), X , less than or equal to a specified expression value (e.g. median expression value) if X belongs to the high expression group. ∞, , , , ∞, , , , , , , , Supporting Tables and Graphics Nucleotides considered from ref. file Gene Chromosome (GRCh38.p2 Primary Assembly) 2000-bp region 500-bp region EEF1A1* 6 73,521,382 - 73,520,027 73,521,074 - 73,520,027 CLUAP1 16 3,498,945 - 3,500,944 3,500,445 - 3,500,944 TPT1 13 45,343,284 - 45,341,285 45,341,784 - 45,341,285 TUBA1B 12 49,133,521 - 49,131,522 49,132,021 - 49,131,522 GGA1 22 37,606,385 - 37,608,384 37,607,885 - 37,608,384 LAIR1 19 54,372,553 - 54,370,554 54,371,053 - 54,370,554 UBC* 12 124,915,549 - 124,913,772 124,915,081 - 124,913,772 VIM 10 17,225,935 - 17,227,934 17,227,435 - 17,227,934 ITIH5 10 7,668,998 - 7,666,999 7,667,498 - 7,666,999 F2R 5 76,714,043 - 76,716,042 76,715,543 - 76,716,042 TMEM158 3 45,228,322 - 45,226,323 45,226,822 - 45,226,323 MSH3 5 80,652,648 - 80,654,647 80,654,148 - 80,654,647 CCR6 6 167,109,807 - 167,111,806 167,111,307 - 167,111,806 ACTG1 17 81,514,866 - 81,512,867 81,513,366 - 81,512,867 FTL 19 48,963,309 - 48,965,308 48,964,809 - 48,965,308 TMSB10 2 84,903,639 - 84,905,638 84,905,139 - 84,905,638 GNB2L1 5 181,245,906 - 181,243,907 181,244,406 - 181,243,907 ROCK2 2 11,346,585 - 11,344,586 11,345,085 - 11,344,586 HSP90AA1 14 102,141,749 - 102,139,750 102,140,249 - 102,139,750 EIF4A1* 17 7,571,970 - 7,572,705 7,572,206 - 7,572,705 Table S1a. Details of 20 most highly expressed genes. *denotes alternative length of promoter considered for annotation. Nucleotides considered from ref. file Gene Chromosome (GRCh38.p2 Primary Assembly) 2000-bp region 500-bp region SUMO4 6 149,398,359 - 149,400,358 149,399,859 - 149,400,358 MAGED4B 23, X 52,071,272 - 52,069,273 52,069,572 - 52,069,273 ARTN 1 43,931,320 - 43,933,319 43,932,820 - 43,933,319 DUSP23 1 159,778,940 - 159,780,939 159,780,440 - 159,780,939 KLC4 6 43,057,594 - 43,059,593 43,059,094 - 43,059,593 NR1I2 3 119,778,484 - 119,780,483 119,779,984 - 119,780,483 PRIM2 6 57,314,305 - 57,314,804 57,312,805 - 57,314,804 MAP2 2 209,422,009 - 209,424,008 209,423,509 - 209,424,008 SMN1 5 70,922,941 - 70,924,940 70,924,441 - 70,924,940 COL25A1 4 109,304,643 - 109,302,644 109,303,143 - 109,302,644 NPTX1 17 80,478,604 - 80,476,605 80,477,104 - 80,476,605 NPSR1 7 34,656,285 - 34,658,284 34,657,785 - 34,658,284 LRRIQ1 12 85,034,321 - 85,036,320 85,035,821 - 85,036,320 MED31 17 6,653,799 - 6,651,800 6,652,299 - 6,651,800 CACNB4 2 152,101,079 - 152,099,080 152,099,579 - 152,099,080 HDAC9 7 18,084,942 - 18,086,941 18,086,442 - 18,086,941 SHC2 19 462,996 - 460,997 461,496 - 460,997 POMC 2 25,170,690 - 25,168,691 25,169,190 - 25,168,691 AGTR1 3 148,695,871 - 148,697,870 148,697,371 - 148,697,870 FGL1 8 17,912,365 - 17,910,366 17,910,865 - 17,910,366 Table S1b. Details of 20 genes with median expression. Nucleotides considered from ref. file Gene Chromosome (GRCh38.p2 Primary Assembly) 2000-bp region 500-bp region ALB 4 73,402,255 - 73,404,254 73,403,755 - 73,404,254 ALG8 11 78,141,660 - 78,139,661 78,140,160 - 78,139,661 ARHGEF11 1 157,048,640 - 157,046,641 157,047,140 - 157,046,641 ASPSCR1 17 81,975,550 - 81,977,549 81,977,050 - 81,977,549 CLIP1 12 122,424,632 - 122,422,633 122,423,132 - 122,422,633 CRNN 1 152,416,274 - 152,414,275 152,414,774 - 152,414,275 DRAM 12 101,875,327 - 101,877,326 101,876,827 - 101,877,326 FAM134B 5 16,619,058 - 16,617,059 16,617,558 - 16,617,059 HDHD3 9 113,379,009 - 113,377,010 113,377,509 - 113,377,010 KRTAP4-2 17 41,180,208 - 41,178,209 41,178,708 - 41,178,209 LYSMD4 15 99,735,458 - 99,733,459 99,733,958 - 99,733,459 MUC17 7 101,018,076 - 101,020,075 101,019,576 - 101,020,075 OR13F1 9 104,502,263 - 104,504,262 104,503,763 - 104,504,262 PLSCR4 3 146,253,179 - 146,251,180 146,251,679 - 146,251,180 POLR3H 22 41,546,606 - 41,544,607 41,545,106 - 41,544,607 POP2 3 191,459,163 - 191,461,162 191,460,663 - 191,461,162 SFI1 22 31,486,688 - 31,488,687 31,488,188 - 31,488,687 TAF9 5 69,372,013 - 69,370,014 69,370,513 - 69,370,014 TCP1 6 159,791,703 - 159,789,704 159,790,203 - 159,789,704 TFDP3 23, X 133,220,348 - 133,218,349 133,218,848 - 133,218,349 Table S1c. Details of 20 randomly selected genes. Nucleotides considered from ref. file promoter Chromosome 2000-bp region 500-bp region CMV IE viral 1 - 2,105 534 - 1,193 SV40 viral 5,176 - 346 5,176 - 346 HSV-TK viral 48,633 - 47,881 48,633 - 47,881 ACTB 7 5,532,044 - 5,529,628 78,103,669 - 78,104,168 PGK1 23, X 78,102,169 - 78,104,168 6,533,905 - 6,534,404 GAPDH 12 6,532,405 - 6,534,404 94,394,061 - 94,394,560 COL1A2 7 94,392,561 - 94,394,560 534 - 1,193 Table S1d. Details of some common viral-derived and endogenous promoters. Human cytomegalovirus immediate-early (CMV IE) promoter taken from M60321.1; Simian Virus 40 (SV40) promoter taken from J02400.1, reverse complement; Herpes Simplex Virus thymidine-kinase (HSV-TK) promoter taken from NC_001806.1, human herpesvirus 1, reverse complement. Endogenous sequences are taken from GRCh38.p2 primary assembly. transcription factor consensus sequence AR ARGAACANNNTGTNC BATF::JUN RNWATGASTCA BRCA1 NCAACMS CDX2 NNGYMATAAAA CEBPA ATTGCAYAAYN CEBPB KATTGCAYMAY CREB (half) CGTCA CREB1 TGACGTCA CREB1 (other) BNBGRTGACGYN CTCF YNRCCASYAGRKGGCRSYN CTCF (short) CCGCGNGGNGGCAG DUX4 TAAYYYAATCA E2F1 NGGGCGGGARV E2F4 NGGCGGGARRN E2F6 RGGCGGGARRN EBF1 NYCCCCWGGGA EGR1 NCNCCGCCCCCKCN EHF MCTTCCTS ELF1 NRANCMGGAAGTG ELK1 NNNCCGGAAR ELK4 NCRCTTCCGGN ESR1 NNNNNAGGTCACCCTGACCY ESR2 AGGTCASNNTGNCCY ESRRA YCAAGGTCACN ETS1 YWTCCK EWSR1-FLI1 GGAAGGAAGGAAGGAAGG FEV CAGGAART FLI1 RCAGGAAGTGR FOS NNTGASTCATN FOSL1 RRTGASTCAKN FOSL2 NRRTGASTCAB FOXA1 TGTTTRCWYWN FOXC1 NNNNNGTA FOXD1 GTAAACAN FOXF2 NNANSGTAAACAAN FOXH1 NNNAATCCACA FOXI1 NNNTRTTTRTTT FOXL1 WNNANATA FOXO3 KGTAAACA FOXP1 NNNANGTAAACAAAN FOXP2 NWGTAAACARN GABPA ACCGGAAGNS GATA2 NNNTTCTTATCTSN GATA3 AGAGAAGA HIF1A::ARNT VNACGTGN HINFP NAACGTCCGC HLF NRTTACRYAATN HNF1A GGTTAATNATTANC HNF1B YYAATRWTTAAC HNF4A NTGRACTTTGNNCYN HNF4G NRRGNNCAAAGKYCA HOXA5 CDBWAATK HSF1 NTTCTRGAANNTTCY INSM1 TGYCWGGGGGCR IRF1 NNNYRSTTTCACTTTCNNTTT transcription factor consensus sequence IRF2 SGAAAGYGAAASCNWWWM JUN NNRATGATGTMAT JUN (var.2) NNNNRRTGASTCAN JUN::FOS TGACTCA JUNB NRRTGASTCAK JUND NRTGASTCATN JUND (var.2) NNNRATGABGTCATN KLF5 GCCCCDCCCH MAFF NCTNASTCAGCANWWWNN MAFK MNNASTCAGCANWWW MAX RRGCACATGK MEF2A NKCTAAAAATAGMNN MEF2C NNNCYAAAAATAGMN MYC::MAX RASCACGTGGT MZF1_1-4 NGGGGA MZF1_5-13 NKAGGGGKAR NF1 (consensus) TTGGCNNNNNNCC NFATC2 TTTTCCA NFE2::MAF ATGACTCAGCANWWN NFE2L2 RTGACWMAGCA NFIC TTGGCN NFIL3 TTAYGTAAYNN NFKB1 KGGRNTTTCCM NFYA AGNSYKCTGATTGGTNNR NFYB NNNYNRRCCAATCAG NHLH1 NCGCAGCTGCGN NKX3-1 ATACTTA NR1H2::RXRA AAAGGTCAAAGGTCAAC NR2C2 NGRGGTCARAGGTCA NR2F1 TGAMCTTTGVMCHT NR4A2 AAGGTCAC NRF1 GCGCNTGCGCR PAX5 NNGNKCAGYSRAGCRTGAC Pax6 TTCACGCWTSANTK PBX1 NCATCAATCAAW PLAG1 GGGGCCCWAGGG POU2F2 NNNATTTGCATRW PPARG STRGGTCACNGTGACCYANT PPARG::RXRA NNWGRGGTCAAAGGTCANNN PRDM1 NRAAAGTGAAAGTNN REL BGGRNWTTCC RELA GGGRRTTTCC REST TTCAGCACCATGGACAGCKCC RFX2 GTYNCCATGGCAACNRNNN RFX5 YNSCMTRGCAACAGN RORA_1 AWNNAGGTCA RORA_2 NWWAWNTAGGTCAN RREB1 CCCCMAAMCAMCCMCMMMCN RUNX1 WWYTGYGGTWW RUNX2 KNKNNYTGTGGTYTK RXR::RAR_DR5 RGKTCANNNRSAGGTCA RXRA::VDR GGGTCAWNGRGTTCA SMAD2::SMAD3::SMAD4 NTGTCTGNCACCT SOX10 CWTTGT SOX9 CYATTGTTN transcription factor consensus sequence SP1 NCCCCKCCCCC SP1 (short) CCKCCY SP2 GYCCCGCCYCYNNNN SPI1 AGGAAGT SPIB WSMGGAA SREBF1 VTCACCCCAY SREBF2 RTGGGGTGAY SRF NNTKNCCAWATAWGGNAA SRY NWWAACAAT STAT1 NTTCYRGGAAA STAT2::STAT1 TNAGTTTCNNTTTCY STAT3 NTTCYKGGAAN TAL1::GATA1 NTTATCWNNNNNNNNCAG TAL1::TCF3 NNAMCATCTGKT TBP STATAWAWRNNNNNN TCF7L2 NNASWTCAAAGNNN TEAD1 YACATTCCWSNG TFAP2A NNNNGCCYSAGGGCA TFAP2C NNNNSCCYCAGGSCN THAP1 YTGCCCNNA TLX1::NFIC TGGCASSRNGCCAA TP53 RCATGYCCAGACATG TP63 NNRCAWGYNCARRCWTGYNN USF1 NNCAYGTGACC USF2 RYCAYGTGACY YY1 CAARATGGCNGC ZBTB33 NTCTCGCGAGANYTN ZEB1 NCWCACCTG ZNF263 GGAGGAGGRRGRGGRGGRRRR ZNF354C MTCCAC Table S2.