Statistical Power for RNA-Seq Data to Detect Two Epigenetic Phenomena

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Power for RNA-Seq Data to Detect Two Epigenetic Phenomena Statistical power for RNA-seq data to detect two epigenetic phenomena Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Dao-Peng Chen, B.B.A., M.S. Graduate Program in Statistics The Ohio State University 2013 Dissertation Committee: Prof. Shili Lin, Advisor Prof. Dennis K. Pearl Prof. Asuman Turkmen c Copyright by Dao-Peng Chen 2013 Abstract Epigenetics is the study of heritable changes in gene expression or cellular pheno- type caused by mechanisms other than changing the underlying DNA sequence. Two epigenetic phenomena, genomic imprinting and AEI, are discussed in this dissertation. Genomic imprinting is an epigenetically regulated process by which imprinted genes are expressed in a parent-of-origin-specific manner, and AEI refers to asymmetric expression of two different alleles at the same locus. Many analysis tools had been used to investigate these two phenomena. Among these tools, RNA-seq is a powerful new technology for mapping and quantifying tran- scriptomes using ultra high throughput next generation sequencing technologies. Us- ing RNA-seq, a genome-wide study can investigate genome-wide genomic imprinting and AEI without prior knowledge of genes or coding regions. Compared to microarray hybridization-based methods, RNA-seq does not have background noise due to hy- bridization. Data from RNA-seq experiments are digital, so they do not have limited range of signals. Compared to the traditional sequence-based methods, RNA-seq is more economical which makes genome-wide mapping feasible. Nevertheless, RNA-seq has its own limitations, such as errors in base calling, reading mapping uncertainty in genome, and biases from transcript length and sequence base composition. In this dissertation, we focus on how investigating sequencing parameters may affect power of tests for detecting imprinting and/or AEI, and whether the current ii technology can provide sufficient power for such an endeavor for mouse and human data. Since existing methods in the literatures are not amenable for detecting such effects and since these two effects may be confounded with one another, we also pro- pose a joint test for simultaneous detection of imprinting and AEI. For mouse data, the reciprocal cross design for mouse, and two definitions of informative reads based on binomial distribution are used throughout this dissertation. The proposed joint test and the two-chi-squares test in the literature are used for power calculation and simulation study, and their results are compared and contrasted. The results show that the joint test is not only applicable for simultaneous detection of the two epi- genetic effects, but it is also more powerful compared to the two-chi-squares test. Furthermore, we note that the formula for power calculation in terms of sequencing parameters (sequencing depth and read length) and sequencing divergence is appli- cable for other tests, not just for the joint test. We provide theoretical power under some combinations of parameters. If an in- formative read is defined as covering at least one SNP for mouse reciprocal cross design, sequencing depth E(T ) = 40 is necessary to achieve sufficient power under read length l = 100 and sequence divergence d = 1%, especially when imprinting and AEI effects (p1; p2) are moderate. Fixing E(T ) = 30 and d = 1%, the power does not improve much when l > 100. It suggests that increasing sequence depth, not read length, is the key to improve power, although it can be expensive. On the other hand, if an informative read is defined as covering a particular SNP, then E(T ) at least 130 and l at least 250 is necessary to achieve sufficient power under d = 2%, even when imprinting and AEI effects are strong. Because increasing read length may have a higher error rate of base calling with currently available technology, we iii suggest increasing sequence depth as more reliable, albeit more expensive, alterna- tive. Note that the two definitions of \an informative read" may be interpreted as a genome-wide or a candidate gene based study, respectively. As for human data, we discuss which trio structures are informative, and we still use the joint test to detect imprinting and AEI. In the theoretical power calculation, except for the effects of sequencing parameters and sequence divergence, the number of families in a random sample of N trios is also considered. If an informative read is defined as covering at least one SNP, sequencing depth E(T ) = 8 is necessary to achieve reasonable power under the setting of N = 50, l = 100 and d = 1%, especially when imprinting and AEI effects are moderate. Fixing E(T ) = 4 and d = 1%, the power does not improve much when l > 100. As for the number of families N, under E(T ) = 4, d = 1% and l = 100, even N = 20 leads to a sufficient power for detecting 0 strong imprinting and AEI effects (such as one of the pis is 0.1). However, a larger size such as N = 100 is necessary for more moderate effects. If an informative read is defined as covering a particular SNP, E(T ) of at least 10 and l of at least 250 is necessary to achieve sufficient power under N = 100 and d = 2%, even when imprinting and/or AEI effects are strong. As for the effect of N, under E(T ) = 10, d = 2% and l = 250, N = 200 is necessary to achieve sufficient power even for strong imprinting and AEI effects. iv Dedicated to my parents, sister and fiancee. v Acknowledgments I sincerely thank my adviser Dr. Shili Lin for her guidance and patience in these years. This dissertation would never have been accomplished without her. I would like to thank Dr. Dennis K. Pearl and Dr. Asuman Turkmen for serving on my dissertation committee, and Dr. Hong Zhu for serving on my Ph.D. candidacy exam committee. vi Vita June 28, 1980 . .Born - Taipei, Taiwan 2002 . .B.B.A. Statistics, National Chengchi University, Taiwan 2004 . .M.S. Statistics, National Tsing Hua University, Taiwan 2008-present . .Graduate Research/Teaching Asso- ciate, Department of Statistics, The Ohio State University Publications Research Publications Fields of Study Major Field: Statistics Major Field: Statistics Studies in: RNA-seq data analysis Prof. Shili Lin Statistical Genetics Prof. Shili Lin vii Table of Contents Page Abstract . ii Dedication . .v Acknowledgments . vi Vita......................................... vii List of Tables . .x List of Figures . xi 1. Introduction . .1 1.1 Epigenetics and RNA-seq . .1 1.2 Genomic imprinting . .3 1.3 Allelic expression imbalance . .5 1.4 Connection among the previous two topics . .6 1.5 Contribution and organization of this dissertation . .8 2. Statistical power for detecting imprinting and AEI in mouse data . 10 2.1 Experimental design for mouse . 10 2.2 Distribution of number of informative reads . 11 2.2.1 A read covers at least one SNP . 12 2.2.2 A read covers a particular SNP . 13 2.3 The joint test . 15 2.3.1 Currently available tests for detecting only imprinting or AEI 15 2.3.2 Rationale for the joint test . 18 2.3.3 Simulation study . 20 2.3.4 Real data study . 31 viii 2.4 Theoretic power for joint test . 35 2.4.1 A read covers at least one SNP . 36 2.4.2 A read covers a particular SNP . 41 3. Statistical power for detecting imprinting and AEI in human data . 54 3.1 Informative trio structures . 54 3.2 Joint test and parameter setting . 55 3.3 Theoretic power for joint test . 57 3.3.1 A read covers at least one SNP . 58 3.3.2 A read covers a particular SNP . 60 4. Summary and Discussion . 78 4.1 Summary . 78 4.2 Discussion and future extensions . 81 4.2.1 Arguable assumptions in this dissertation . 81 4.2.2 Discussion in human data . 82 4.2.3 Allele specific methylation . 83 4.2.4 Testing for imprinting, AEI and ASM sequentially . 83 4.2.5 A confidence region for p1 and p2 ............... 85 Bibliography 88 ix List of Tables Table Page 2.1 2 x 2 table in a reciprocal cross design . 11 2.2 Simulation setting under H1 ....................... 21 2.3 Summary of power in simulated data . 28 2.4 Imprinted genes in mouse brain . 33 2.5 Imprinted genes in the non-brain tissues of the mouse . 34 3.1 The six types of informative trio structures . 55 x List of Figures Figure Page 2.1 Rejection regions (black) of the three tests when n1=30 and n2=40 . 16 2.2 Four groups according to the underlying values of p1 and p2 ..... 19 2.3 The nine subsets of H1 .......................... 22 2.4 Counts of total SNPs under H0, and counts of rejection . 24 2.5 Counts of total SNPs under H1.A, and counts of rejection . 25 2.6 Counts of total SNPs under H1.B, and counts of rejection . 26 2.7 Counts of total SNPs under H1.C1 and H1.C2, and counts of rejection 27 2.8 Counts of total SNPs under H1.C3, and counts of rejection . 29 2.9 Counts of total SNPs under H1.C4, and counts of rejection . 30 2.10 Testing result of a SNP (UCSC id: uc009kou.1 2) in gene Cd81 . 35 2.11 Power image plots for different E(T ), where an informative read is defined as covering at least one SNP . 39 2.12 Power curve plots for different E(T ), where an informative read is defined as covering at least one SNP . 40 2.13 Power image plots for different d, where an informative read is defined as covering at least one SNP .
Recommended publications
  • Analyses of Allele-Specific Gene Expression in Highly Divergent
    ARTICLES Analyses of allele-specific gene expression in highly divergent mouse crosses identifies pervasive allelic imbalance James J Crowley1,10, Vasyl Zhabotynsky1,10, Wei Sun1,2,10, Shunping Huang3, Isa Kemal Pakatci3, Yunjung Kim1, Jeremy R Wang3, Andrew P Morgan1,4,5, John D Calaway1,4,5, David L Aylor1,9, Zaining Yun1, Timothy A Bell1,4,5, Ryan J Buus1,4,5, Mark E Calaway1,4,5, John P Didion1,4,5, Terry J Gooch1,4,5, Stephanie D Hansen1,4,5, Nashiya N Robinson1,4,5, Ginger D Shaw1,4,5, Jason S Spence1, Corey R Quackenbush1, Cordelia J Barrick1, Randal J Nonneman1, Kyungsu Kim2, James Xenakis2, Yuying Xie1, William Valdar1,4, Alan B Lenarcic1, Wei Wang3,9, Catherine E Welsh3, Chen-Ping Fu3, Zhaojun Zhang3, James Holt3, Zhishan Guo3, David W Threadgill6, Lisa M Tarantino7, Darla R Miller1,4,5, Fei Zou2,11, Leonard McMillan3,11, Patrick F Sullivan1,5,7,8,11 & Fernando Pardo-Manuel de Villena1,4,5,11 Complex human traits are influenced by variation in regulatory DNA through mechanisms that are not fully understood. Because regulatory elements are conserved between humans and mice, a thorough annotation of cis regulatory variants in mice could aid in further characterizing these mechanisms. Here we provide a detailed portrait of mouse gene expression across multiple tissues in a three-way diallel. Greater than 80% of mouse genes have cis regulatory variation. Effects from these variants influence complex traits and usually extend to the human ortholog. Further, we estimate that at least one in every thousand SNPs creates a cis regulatory effect.
    [Show full text]
  • BIOINFORMATICS Pages 1–7
    Vol. 00 no. 00 2010 BIOINFORMATICS Pages 1–7 Integrative classification and analysis of multiple arrayCGH datasets with probe alignment Ze Tian and Rui Kuang∗ Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, USA Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT 2009). Chromosome copy number variations can be measured by Motivation: Array comparative genomic hybridization (ArrayCGH) comparative genomic hybridization (CGH), which compares the is widely used to measure DNA copy numbers in cancer research. copy number of a differentially labeled case sample with a reference ArrayCGH data report log-ratio intensities of thousands of probes DNA from a normal individual. ArrayCGH technology based on sampled along the chromosomes. Typically, the choices of the DNA microarray can currently allow genome-wide identification locations and the lengths of the probes vary in different experiments. of regions with copy number variations at different resolutions This discrepancy in choosing probes poses a challenge in integrated (Carter, 2007). The arrayCGH data was used to discriminate healthy classification or analysis across multiple arrayCGH datasets. We patients from cancer patients and classify patients of different cancer propose an alignment based framework to integrate arrayCGH subtypes. Thus, arrayCGH data is considered as a new source of samples generated from different probe sets. The alignment biomarkers that provide important information of candidate cancer framework seeks an optimal alignment between the probe series loci for the classification of patients and discovery of molecular of one arrayCGH sample and the probe series of another sample, mechanisms of cancers (Sykes et al., 2009).
    [Show full text]
  • Comparative Analysis of Human Chromosome 7Q21 and Mouse
    Downloaded from genome.cshlp.org on October 2, 2021 - Published by Cold Spring Harbor Laboratory Press Letter Comparative analysis of human chromosome 7q21 and mouse proximal chromosome 6 reveals a placental-specific imprinted gene, TFPI2/Tfpi2, which requires EHMT2 and EED for allelic-silencing David Monk,1,6 Alexandre Wagschal,2 Philippe Arnaud,2 Pari-Sima Mu¨ller,3 Layla Parker-Katiraee,4 Déborah Bourc’his,5 Stephen W. Scherer,4 Robert Feil,2 Philip Stanier,1 and Gudrun E. Moore1 1Institute of Child Health, London WC1N 1EH, United Kingdom; 2Institute of Molecular Genetics, CNRS UMR-5535 and University of Montpellier-II, 34293 Montpellier, France; 3Sir William Dunn School of Pathology, University of Oxford, Oxford OX1 3RE, United Kingdom; 4Center for Applied Genomics, The Hospital for Sick Children, Toronto M5G 1L7, Canada; 5Inserm U741, F-75251 Paris Cedex 05, France Genomic imprinting is a developmentally important mechanism that involves both differential DNA methylation and allelic histone modifications. Through detailed comparative characterization, a large imprinted domain mapping to chromosome 7q21 in humans and proximal chromosome 6 in mice was redefined. This domain is organized around a maternally methylated CpG island comprising the promoters of the adjacent PEG10 and SGCE imprinted genes. Examination of Dnmt3l−/+ conceptuses shows that imprinted expression for all genes of the cluster depends upon the germline methylation at this putative “imprinting control region” (ICR). Similarly as for other ICRs, we find its DNA-methylated allele to be associated with trimethylation of lysine 9 on histone H3 (H3K9me3) and trimethylation of lysine 20 on histone H4 (H4K20me3), whereas the transcriptionally active paternal allele is enriched in H3K4me2 and H3K9 acetylation.
    [Show full text]
  • Genome-Wide Profiling of RNA Editing Sites in Sheep Yuanyuan Zhang1,2, Deping Han1, Xianggui Dong1, Jiankui Wang1, Jianfei Chen1, Yanzhu Yao1, Hesham Y
    Zhang et al. Journal of Animal Science and Biotechnology (2019) 10:31 https://doi.org/10.1186/s40104-019-0331-z RESEARCH Open Access Genome-wide profiling of RNA editing sites in sheep Yuanyuan Zhang1,2, Deping Han1, Xianggui Dong1, Jiankui Wang1, Jianfei Chen1, Yanzhu Yao1, Hesham Y. A. Darwish1,3, Wansheng Liu2* and Xuemei Deng1* Abstract Background: The widely observed RNA-DNA differences (RDDs) have been found to be due to nucleotide alteration by RNA editing. Canonical RNA editing (i.e., A-to-I and C-to-U editing) mediated by the adenosine deaminases acting on RNA (ADAR) family and apolipoprotein B mRNA editing catalytic polypeptide-like (APOBEC) family during the transcriptional process is considered common and essential for the development of an individual. To date, an increasing number of RNA editing sites have been reported in human, rodents, and some farm animals; however, genome-wide detection of RNA editing events in sheep has not been reported. The aim of this study was to identify RNA editing events in sheep by comparing the RNA-seq and DNA-seq data from three biological replicates of the kidney and spleen tissues. Results: A total of 607 and 994 common edited sites within the three biological replicates were identified in the ovine kidney and spleen, respectively. Many of the RDDs were specific to an individual. The RNA editing-related genes identified in the present study might be evolved for specific biological functions in sheep, such as structural constituent of the cytoskeleton and microtubule-based processes. Furthermore, the edited sites found in the ovine BLCAP and NEIL1 genes are in line with those in previous reports on the porcine and human homologs, suggesting the existence of evolutionarily conserved RNA editing sites and they may play an important role in the structure and function of genes.
    [Show full text]
  • Mouse Ppp1r9a Conditional Knockout Project (CRISPR/Cas9)
    https://www.alphaknockout.com Mouse Ppp1r9a Conditional Knockout Project (CRISPR/Cas9) Objective: To create a Ppp1r9a conditional knockout Mouse model (C57BL/6J) by CRISPR/Cas-mediated genome engineering. Strategy summary: The Ppp1r9a gene (NCBI Reference Sequence: NM_181595 ; Ensembl: ENSMUSG00000032827 ) is located on Mouse chromosome 6. 16 exons are identified, with the ATG start codon in exon 2 and the TGA stop codon in exon 16 (Transcript: ENSMUST00000035813). Exon 3 will be selected as conditional knockout region (cKO region). Deletion of this region should result in the loss of function of the Mouse Ppp1r9a gene. To engineer the targeting vector, homologous arms and cKO region will be generated by PCR using BAC clone RP23-2F8 as template. Cas9, gRNA and targeting vector will be co-injected into fertilized eggs for cKO Mouse production. The pups will be genotyped by PCR followed by sequencing analysis. Note: Mice homozygous for a knock-out allele exhibit defects in dopamine-mediated neuromodulation, deficient long-term potentiation at corticostriatal synapses, increased spontaneous excitatory post-synaptic current frequency, and enhanced locomotor activationin response to cocaine treatment. Exon 3 starts from about 42.59% of the coding region. The knockout of Exon 3 will result in frameshift of the gene. The size of intron 2 for 5'-loxP site insertion: 139091 bp, and the size of intron 3 for 3'-loxP site insertion: 11389 bp. The size of effective cKO region: ~633 bp. The cKO region does not have any other known gene. Page 1 of 8 https://www.alphaknockout.com Overview of the Targeting Strategy Wildtype allele gRNA region 5' gRNA region 3' 1 3 16 Targeting vector Targeted allele Constitutive KO allele (After Cre recombination) Legends Exon of mouse Ppp1r9a Homology arm cKO region loxP site Page 2 of 8 https://www.alphaknockout.com Overview of the Dot Plot Window size: 10 bp Forward Reverse Complement Sequence 12 Note: The sequence of homologous arms and cKO region is aligned with itself to determine if there are tandem repeats.
    [Show full text]
  • Chromosome 20 Shows Linkage with DSM-IV Nicotine Dependence in Finnish Adult Smokers
    Nicotine & Tobacco Research, Volume 14, Number 2 (February 2012) 153–160 Original Investigation Chromosome 20 Shows Linkage With DSM-IV Nicotine Dependence in Finnish Adult Smokers Kaisu Keskitalo-Vuokko, Ph.D.,1 Jenni Hällfors, M.Sc.,1,2 Ulla Broms, Ph.D.,1,3 Michele L. Pergadia, Ph.D.,4 Scott F. Saccone, Ph.D.,4 Anu Loukola, Ph.D.,1,3 Pamela A. F. Madden, Ph.D.,4 & Jaakko Kaprio, M.D., Ph.D.1,2,3 1 Hjelt Institute, Department of Public Health, University of Helsinki, Helsinki, Finland 2 Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland 3 National Institute for Health and Welfare (THL), Helsinki, Finland 4 Department of Psychiatry, Washington University School of Medicine, St. Louis, MO Corresponding Author: Jaakko Kaprio, M.D., Ph.D., Department of Public Health, University of Helsinki, PO Box 41 (Manner- heimintie 172), Helsinki 00014, Finland. Telephone: +358-9-191-27595; Fax: +358-9-19127570; E-mail: [email protected] Received February 5, 2011; accepted June 21, 2011 2009). Despite several gene-mapping studies, the genes underlying Abstract liability to nicotine dependence (ND) remain largely unknown. Introduction: Chromosome 20 has previously been associated Recently, Han, Gelernter, Luo, and Yang (2010) performed a with nicotine dependence (ND) and smoking cessation. Our meta-analysis of 15 genome-wide linkage scans of smoking aim was to replicate and extend these findings. behavior. Linkage signals were observed on chromosomal regions 17q24.3–q25.3, 5q33.1–q35.2, 20q13.12–32, and 22q12.3–13.32. Methods: First, a total of 759 subjects belonging to 206 Finnish The relevance of the chromosome 20 finding is highlighted families were genotyped with 18 microsatellite markers residing by the fact that CHRNA4 encoding the nicotinic acetylcholine on chromosome 20, in order to replicate previous linkage findings.
    [Show full text]
  • Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications
    Stem Cell Rev and Rep DOI 10.1007/s12015-016-9662-8 Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications Behnam Ahmadian Baghbaderani 1 & Adhikarla Syama2 & Renuka Sivapatham3 & Ying Pei4 & Odity Mukherjee2 & Thomas Fellner1 & Xianmin Zeng3,4 & Mahendra S. Rao5,6 # The Author(s) 2016. This article is published with open access at Springerlink.com Abstract We have recently described manufacturing of hu- help determine which set of tests will be most useful in mon- man induced pluripotent stem cells (iPSC) master cell banks itoring the cells and establishing criteria for discarding a line. (MCB) generated by a clinically compliant process using cord blood as a starting material (Baghbaderani et al. in Stem Cell Keywords Induced pluripotent stem cells . Embryonic stem Reports, 5(4), 647–659, 2015). In this manuscript, we de- cells . Manufacturing . cGMP . Consent . Markers scribe the detailed characterization of the two iPSC clones generated using this process, including whole genome se- quencing (WGS), microarray, and comparative genomic hy- Introduction bridization (aCGH) single nucleotide polymorphism (SNP) analysis. We compare their profiles with a proposed calibra- Induced pluripotent stem cells (iPSCs) are akin to embryonic tion material and with a reporter subclone and lines made by a stem cells (ESC) [2] in their developmental potential, but dif- similar process from different donors. We believe that iPSCs fer from ESC in the starting cell used and the requirement of a are likely to be used to make multiple clinical products. We set of proteins to induce pluripotency [3]. Although function- further believe that the lines used as input material will be used ally identical, iPSCs may differ from ESC in subtle ways, at different sites and, given their immortal status, will be used including in their epigenetic profile, exposure to the environ- for many years or even decades.
    [Show full text]
  • Mclean, Chelsea.Pdf
    COMPUTATIONAL PREDICTION AND EXPERIMENTAL VALIDATION OF NOVEL MOUSE IMPRINTED GENES A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Chelsea Marie McLean August 2009 © 2009 Chelsea Marie McLean COMPUTATIONAL PREDICTION AND EXPERIMENTAL VALIDATION OF NOVEL MOUSE IMPRINTED GENES Chelsea Marie McLean, Ph.D. Cornell University 2009 Epigenetic modifications, including DNA methylation and covalent modifications to histone tails, are major contributors to the regulation of gene expression. These changes are reversible, yet can be stably inherited, and may last for multiple generations without change to the underlying DNA sequence. Genomic imprinting results in expression from one of the two parental alleles and is one example of epigenetic control of gene expression. So far, 60 to 100 imprinted genes have been identified in the human and mouse genomes, respectively. Identification of additional imprinted genes has become increasingly important with the realization that imprinting defects are associated with complex disorders ranging from obesity to diabetes and behavioral disorders. Despite the importance imprinted genes play in human health, few studies have undertaken genome-wide searches for new imprinted genes. These have used empirical approaches, with some success. However, computational prediction of novel imprinted genes has recently come to the forefront. I have developed generalized linear models using data on a variety of sequence and epigenetic features within a training set of known imprinted genes. The resulting models were used to predict novel imprinted genes in the mouse genome. After imposing a stringency threshold, I compiled an initial candidate list of 155 genes.
    [Show full text]
  • High-Throughput Biochemical Analysis of In-Vivo Location Data Reveals Novel Distinct Classes of POU5F1(Oct4)/DNA Complexes. ABST
    Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press High-throughput Biochemical Analysis of in-vivo Location Data Reveals Novel Distinct Classes of POU5F1(Oct4)/DNA complexes. Dean Tantin1^, Matthew Gemberling2^, Catherine Callister1, William Fairbrother* 2,3 1 2 3 Department of Pathology, University of Utah School of Medicine, Salt Lake City, Utah 84112; MCB Department, Brown University, Providence, Rhode Island. Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912 ^ These Authors contributed equally to this work. *To whom correspondence should be addressed: Will Fairbrother, Brown University, Providence RI [email protected] ABSTRACT: The transcription factor POU5F1 is a key regulator of embryonic stem (ES) cell pluripotency and a known oncoprotein. We have developed a novel high-throughput binding assay called MEGAshift (microarray evaluation of genomic aptamers by shift) that we use to pinpoint the exact location, affinity and stoichiometry of the DNA-protein complexes identified by chromatin immunoprecipitation studies. We consider all genomic regions identified as POU5F1-ChIP-enriched in both human and mouse. Compared to regions that are ChIP- enriched in a single species, we find these regions more likely to be near actively transcribed genes in ES cells. We re-synthesize these genomic regions as a pool of tiled 35-mers. This oligonucleotide pool is then assayed for binding to recombinant POU5F1 by gel shift. The degree of binding for each oligonucleotide is accurately measured on a specially designed microarray. We explore the relationship between experimentally determined and computationally predicted binding strengths, find many novel functional combinations of POU5F1 half sites and demonstrate efficient motif discovery by incorporating binding information into a motif finding algorithm.
    [Show full text]
  • The DNA Sequence and Comparative Analysis of Human Chromosome 20
    articles The DNA sequence and comparative analysis of human chromosome 20 P. Deloukas, L. H. Matthews, J. Ashurst, J. Burton, J. G. R. Gilbert, M. Jones, G. Stavrides, J. P. Almeida, A. K. Babbage, C. L. Bagguley, J. Bailey, K. F. Barlow, K. N. Bates, L. M. Beard, D. M. Beare, O. P. Beasley, C. P. Bird, S. E. Blakey, A. M. Bridgeman, A. J. Brown, D. Buck, W. Burrill, A. P. Butler, C. Carder, N. P. Carter, J. C. Chapman, M. Clamp, G. Clark, L. N. Clark, S. Y. Clark, C. M. Clee, S. Clegg, V. E. Cobley, R. E. Collier, R. Connor, N. R. Corby, A. Coulson, G. J. Coville, R. Deadman, P. Dhami, M. Dunn, A. G. Ellington, J. A. Frankland, A. Fraser, L. French, P. Garner, D. V. Grafham, C. Grif®ths, M. N. D. Grif®ths, R. Gwilliam, R. E. Hall, S. Hammond, J. L. Harley, P. D. Heath, S. Ho, J. L. Holden, P. J. Howden, E. Huckle, A. R. Hunt, S. E. Hunt, K. Jekosch, C. M. Johnson, D. Johnson, M. P. Kay, A. M. Kimberley, A. King, A. Knights, G. K. Laird, S. Lawlor, M. H. Lehvaslaiho, M. Leversha, C. Lloyd, D. M. Lloyd, J. D. Lovell, V. L. Marsh, S. L. Martin, L. J. McConnachie, K. McLay, A. A. McMurray, S. Milne, D. Mistry, M. J. F. Moore, J. C. Mullikin, T. Nickerson, K. Oliver, A. Parker, R. Patel, T. A. V. Pearce, A. I. Peck, B. J. C. T. Phillimore, S. R. Prathalingam, R. W. Plumb, H. Ramsay, C. M.
    [Show full text]
  • Uncovering Cancer Gene Regulation by Accurate Regulatory Network Inference from Uninformative Data
    www.nature.com/npjsba ARTICLE OPEN Uncovering cancer gene regulation by accurate regulatory network inference from uninformative data Deniz Seçilmiş 1, Thomas Hillerton1, Daniel Morgan 1, Andreas Tjärnberg 2, Sven Nelander3, Torbjörn E. M. Nordling 4 and ✉ Erik L. L. Sonnhammer 1 The interactions among the components of a living cell that constitute the gene regulatory network (GRN) can be inferred from perturbation-based gene expression data. Such networks are useful for providing mechanistic insights of a biological system. In order to explore the feasibility and quality of GRN inference at a large scale, we used the L1000 data where ~1000 genes have been perturbed and their expression levels have been quantified in 9 cancer cell lines. We found that these datasets have a very low signal-to-noise ratio (SNR) level causing them to be too uninformative to infer accurate GRNs. We developed a gene reduction pipeline in which we eliminate uninformative genes from the system using a selection criterion based on SNR, until reaching an informative subset. The results show that our pipeline can identify an informative subset in an overall uninformative dataset, allowing inference of accurate subset GRNs. The accurate GRNs were functionally characterized and potential novel cancer-related regulatory interactions were identified. npj Systems Biology and Applications (2020) 6:37 ; https://doi.org/10.1038/s41540-020-00154-6 1234567890():,; INTRODUCTION where the main aim is to improve the SNR of the dataset by Living organisms are orchestrated by the biochemical reactions permanently removing the least informative genes and their that occur as a result of the interactions between biomolecules.
    [Show full text]
  • The Human Gene Connectome As a Map of Short Cuts for Morbid Allele Discovery
    The human gene connectome as a map of short cuts for morbid allele discovery Yuval Itana,1, Shen-Ying Zhanga,b, Guillaume Vogta,b, Avinash Abhyankara, Melina Hermana, Patrick Nitschkec, Dror Friedd, Lluis Quintana-Murcie, Laurent Abela,b, and Jean-Laurent Casanovaa,b,f aSt. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065; bLaboratory of Human Genetics of Infectious Diseases, Necker Branch, Paris Descartes University, Institut National de la Santé et de la Recherche Médicale U980, Necker Medical School, 75015 Paris, France; cPlateforme Bioinformatique, Université Paris Descartes, 75116 Paris, France; dDepartment of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel; eUnit of Human Evolutionary Genetics, Centre National de la Recherche Scientifique, Unité de Recherche Associée 3012, Institut Pasteur, F-75015 Paris, France; and fPediatric Immunology-Hematology Unit, Necker Hospital for Sick Children, 75015 Paris, France Edited* by Bruce Beutler, University of Texas Southwestern Medical Center, Dallas, TX, and approved February 15, 2013 (received for review October 19, 2012) High-throughput genomic data reveal thousands of gene variants to detect a single mutated gene, with the other polymorphic genes per patient, and it is often difficult to determine which of these being of less interest. This goes some way to explaining why, variants underlies disease in a given individual. However, at the despite the abundance of NGS data, the discovery of disease- population level, there may be some degree of phenotypic homo- causing alleles from such data remains somewhat limited. geneity, with alterations of specific physiological pathways under- We developed the human gene connectome (HGC) to over- come this problem.
    [Show full text]