Detecting and Analyzing Genomic Structural Variation Using Distributed Computing

Total Page:16

File Type:pdf, Size:1020Kb

Detecting and Analyzing Genomic Structural Variation Using Distributed Computing Detecting and Analyzing Genomic Structural Variation using Distributed Computing Christopher Whelan A.B. Computer Science, Harvard University, 1997 Presented to the Center for Spoken Language Understanding within the Oregon Health & Science University School of Medicine in partial fulfillment of the requirements for the degree Doctor of Philosophy in Computer Science & Engineering February 2014 Center for Spoken Language Understanding School of Medicine Oregon Health & Science University CERTIFICATE OF APPROVAL This is to certify that the Ph.D. dissertation of Christopher Whelan has been approved. Dr. Kemal Sönmez, Thesis Advisor Associate Professor Dr. Izhak Shafran Associate Professor Dr. Brian Roark Research Scientist, Google, Inc. Dr. Lucia Carbone Assistant Professor, Dept. of Behavioral Neuroscience Dr. Steven Bedrick Assistant Professor ii Acknowledgments This dissertation would not be possible without my co-advisors, Kemal Sönmez and Lucia Carbone. Kemal has been an excellent mentor and a great friend. He helped me to develop many ideas that went into this dissertation. Of all of his valuable contributions, I am most thankful for his ability to point me in new directions and re-energize me when I was discouraged or had painted myself into a corner. Lucia has also been a terrific role model and an inspiring scientific leader. The day I was given the chance to collaborate with her and her lab was my luckiest moment at OHSU. Lucia has taught me most of the biology I know, given me the opportunity to work on tremendously exciting projects, and gone out of her way many times to introduce me to members of the genomics community; in particular, I owe a great debt of gratitude to her for bringing me annually to the Biology of Genomes meeting at Cold Spring Harbor. Com- putational work cannot have real value without actual biological data and real scientific questions to work on and learn from, both of which Lucia provided. Of course, Lucia also introduced me to the problem of structural variation detection; this dissertation could not exist without her. I would also like to thank the other members of the Carbone Lab, who have been a pleasure to work with and learn from: Larry Wilhelm, Josh Meyer, Nathan Lazar, Liz Terhune, and Kim Nevonen. I would also like to thank Brian Roark and Zak Shafran for all of the great help they’ve given me. Brian is a fabulous teacher and always provided excellent feedback and made himself available for questions or just to talk. Cloudbreak started as a project in a course taught by Zak and Richard Sproat on problem solving with large clusters, which provided me with the core idea that eventually turned into this dissertation. Zak has done a great deal of other things for me as well: providing me with ideas to use in my research; inviting me to explore new topics in machine learning with him and his group; and last but not iii least acquiring and helping to configure and administer the Hadoop compute clusters at CSLU, without which this work would not be possible. On that last note, I would also like to thank cluster administrators who have put up with my sometimes taxing usage of our compute resources, including Rob Stites and Jason Brooks. I would also like to acknowledge members of the broader genomics community who have provided feedback and encouragement for my projects, including Bob Handsaker, Steve McCarroll, and Ben Raphael. Finally, my greatest thanks go to my wife Sarah and my son Colin, who displayed tremendous patience in putting up with their husband and father’s graduate-school lifestyle. Sarah’s support, encouragement, and love have been an integral component of this work, without which it could never have happened. iv Contents Acknowledgments :::::::::::::::::::::::::::::::::::: iii 1 Introduction ::::::::::::::::::::::::::::::::::::: 1 2 Biological Background ::::::::::::::::::::::::::::::: 6 2.1 Structural Variations . 6 2.1.1 SV Effects on Phenotype and Disease . 7 2.1.2 Mechanisms and Signatures of SV Formation . 8 2.2 High-Throughput Short-Read Sequencing . 10 2.2.1 Sequencing Analysis Pipelines . 12 2.2.2 Big Data from Sequencing . 16 3 Algorithms for Structural Variation Detection :::::::::::::::: 18 3.1 The Four Signals of SVs in Sequencing Data . 18 3.2 Read Pair Approaches . 19 3.2.1 Ambiguously Mapped Read Pairs . 21 3.2.2 Concordant Read Pairs . 22 3.3 Read Depth Approaches . 24 3.4 Split Read Approaches . 24 3.5 Assembly-Based Approaches . 25 3.6 Hybrid Approaches . 26 3.6.1 Support from Secondary Signals . 27 3.6.2 Pipelines . 27 3.6.3 Integrative Models . 28 3.7 An Example of an SV Detection Pipeline for a Cancer Dataset . 28 4 A Framework for SV Detection in MapReduce :::::::::::::::: 32 4.1 MapReduce and Hadoop . 32 4.2 Uses of Hadoop and MapReduce in Sequencing Tasks . 35 4.2.1 DNA Resequencing . 36 4.2.2 RNA-seq . 38 v 4.2.3 de novo Assembly . 39 4.2.4 Frameworks and Toolkits . 40 4.2.5 Other uses of Hadoop and MapReduce . 40 4.3 MapReduce Constraints on SV Algorithms . 41 4.4 A General MapReduce SV Detection Algorithm . 41 4.5 Discussion . 44 5 Cloudbreak :::::::::::::::::::::::::::::::::::::: 46 5.1 Variant types detected . 46 5.2 Framework infrastructure . 47 5.3 Implementation of a MapReduce SV Algorithm . 49 5.4 Filtering Incorrect and Ambiguous Mappings . 52 5.5 Genotyping . 54 5.6 Running in the Cloud . 55 5.6.1 Cloud-Enabled Genomics Tools . 55 5.6.2 Enabling Cloud Computing with Whirr . 56 5.7 Discussion . 58 6 Evaluating Cloudbreak ::::::::::::::::::::::::::::::: 59 6.1 Evaluation Methods . 59 6.1.1 Choice of SV Detection Tools to Compare To . 59 6.1.2 Simulated and Biological Data Sets . 60 6.1.3 Parameters Used for Alignment and SV Detection . 61 6.1.4 SV Prediction Evaluation . 62 6.2 Results on Simulated Data . 63 6.2.1 Accuracy and Runtime . 63 6.2.2 Choice of Window Size . 66 6.3 Results on Biological Data . 68 6.3.1 Accuracy and Runtime . 68 6.3.2 Breakpoint Resolution . 71 6.4 Results on a Low-Coverage Cancer Data Set . 73 6.5 Genotyping Variants . 74 6.6 Notes on Evaluating Runtime . 74 6.7 Choice of Aligner and Use of Multiple Mappings . 76 6.8 Discussion . 78 vi 7 Extending Local Feature Based Models of SV Detection in a Discrimi- native Machine Learning Framework :::::::::::::::::::::: 80 7.1 Related Work . 81 7.2 SV Detection as a Sequence Labeling Problem . 81 7.3 Graphical Models for Sequence Labeling . 82 7.4 Integrating Features with Conditional Random Fields . 84 7.5 Features for SV Detection . 85 7.5.1 Read Pair Features . 86 7.5.2 Split-read features . 87 7.5.3 Read depth features . 88 7.5.4 Genome annotations . 89 7.5.5 Binarization of real-valued features . 89 7.5.6 A feature example . 90 7.5.7 Interaction and neighbor features . 91 7.6 Training the CRF . 91 7.7 Improving Cloudbreak Calls with CRF Predictions . 94 7.8 Results . 95 7.9 Features Selected by the CRF . 96 7.10 Discussion . 98 8 Analysis of Evolutionary Breakpoint Features :::::::::::::::: 101 8.1 Background . 102 8.2 Evolutionary Breakpoints in the Gibbon Genome Identified by BACs . 104 8.2.1 Methods . 105 8.2.2 Results . 106 8.3 Analysis of Breakpoints from the Gibbon Genome Reference Sequence . 106 8.3.1 Overlap with Genomic Features: Repeats and Genes . 108 8.3.2 Overlap with CTCF Binding Sites . 111 8.4 Discussion . 115 9 Future Work ::::::::::::::::::::::::::::::::::::: 118 9.1 Cloudbreak . 118 9.2 SV Detection with Discriminative Machine Learning . 120 9.3 Gibbon Genome Breakpoint Analysis . 121 Bibliography ::::::::::::::::::::::::::::::::::::::: 122 vii List of Tables 3.1 A summary of published SV detection algorithms that combine more than one sequencing signal. 27 6.1 Deletions and insertions in the simulated data detected at maximum sensi- tivity. 67 6.2 Deletions and insertions in the simulated data detected at a low false dis- covery rate. 67 6.3 Cloudbreak accuracy with varying window sizes. 69 6.4 Precision, recall, and size of variants found in the NA18507 data set. 72 6.5 Confusion matrices for the predicted genotype of deletions. 74 6.6 Detailed runtimes for both data sets. 76 6.7 Detected deletions and insertions on the simulated and NA18507 data sets that overlap with repetitive elements. 78 7.1 Feature definition for the CRF training and test data. 90 7.2 Most important features learned by the CRF model for deletion and inser- tion breakpoints. 98 8.1 Enrichment counts and scores of features in gibbon breakpoint flanking re- gions. 110 viii List of Figures 1.1 Similarity of genomic breakpoints that occur in cancer and evolution. 3 2.1 An example of a computational pipeline for high-throughput short-read DNA resequencing projects. ..
Recommended publications
  • SM08 Programme
    SATELLITE MEETING Regulation of gene expression from a distance: exploring mechanisms The Royal Society at Chicheley Hall, home of the Kavli Royal Society International Centre Organised by Professor Wendy Bickmore and Professor Veronica van Heyningen FRS Wednesday 24 – Thursday 25 October 2012 DAY 1 DAY 2 SESSION 1 SESSION 2 SESSION 3 SESSION 4 Enhancer assays Quantitative and dynamic analysis Quantitative & dynamic analysis of Defining enhancers and their mechanisms – transgenes, genetics, and interactomes of transcription protein binding at regulatory of action Chair: Professor Nick Hastie CBE FRS Chair: elements Chair: Dr Duncan Odom Welcome by RS & lead 09.00 Professor Anne Ferguson-Smith Chair: Professor Constance Bonifer organiser The evolution of global Dynamic use of enhancers in Design rules for bacterial Massively parallel functional 09.05 enhancers 13.30 development 09.00 enhancers 13.15 dissection of mammalian enhancers Professor Denis Duboule Professor Mike Levine Dr Roee Amit Dr Rupali Patwardhan 09.30 Discussion 14.00 Discussion 09.30 Discussion 13.45 Discussion Complex protein dynamics at HERC2 rs12913832 modulates The pluripotent 3D genome Gene expression genomics eukaryotic regulatory human pigmentation by attenuating 09.45 14.15 09.45 Professor Wouter de Laat Dr Sarah Teichmann elements chromatin-loop formation between a 14.00 Dr Gordon Hager long-range enhancer and the OCA2 promoter 10.15 Discussion 14.45 Discussion 10.15 Discussion Dr Robert-Jan Palstra 10.30 Coffee 15.00 Tea 10.30 Coffee 14.15 Tea Maps of open chromatin
    [Show full text]
  • High-Resolution Mapping of Transcriptional Dynamics Across Tissue Development Reveals a Stable Mrna–Trna Interface
    Downloaded from genome.cshlp.org on September 30, 2021 - Published by Cold Spring Harbor Laboratory Press Research High-resolution mapping of transcriptional dynamics across tissue development reveals a stable mRNA–tRNA interface Bianca M. Schmitt,1,5 Konrad L.M. Rudolph,2,5 Panagiota Karagianni,3 Nuno A. Fonseca,2 Robert J. White,4 Iannis Talianidis,3 Duncan T. Odom,1 John C. Marioni,2 and Claudia Kutter1 1University of Cambridge, Cancer Research UK Cambridge Institute, Cambridge, CB2 0RE, United Kingdom; 2European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom; 3B.S.R.C. Alexander Flemming, 16672, Vari, Athens, Greece; 4University of York, Department of Biology, Heslington, York, YO10 5DD, United Kingdom The genetic code is an abstraction of how mRNA codons and tRNA anticodons molecularly interact during protein synthesis; the stability and regulation of this interaction remains largely unexplored. Here, we characterized the expression of mRNA and tRNA genes quantitatively at multiple time points in two developing mouse tissues. We dis- covered that mRNA codon pools are highly stable over development and simply reflect the genomic background; in contrast, precise regulation of tRNA gene families is required to create the corresponding tRNA transcriptomes. The dynamic regulation of tRNA genes during development is controlled in order to generate an anticodon pool that closely corresponds to messenger RNAs. Thus, across development, the pools of mRNA codons and tRNA anticodons are invariant and highly correlated, revealing a stable molecular interaction interlocking transcription and translation. [Supplemental material is available for this article.] Transcription of the mammalian genome is divided among mul- differencesinexpressionofspecifictRNAgenesmaycontribute tiple RNA polymerases (Pol), each transcribing a nonoverlapping to tumorigenesis by favoring translation of cancer-promoting set of genes.
    [Show full text]
  • DKFZ Postdoctoral Fellowships 2021 Hosting Group Information For
    DKFZ Postdoctoral Fellowships 2021 Hosting group information for applicants Name of DKFZ research division/group: Division of Regulatory Genomics and Cancer Evolution (B270) Contact person: Duncan Odom Group homepage: https://www.dkfz.de/en/regulatorische-genomik/index.php Please visit our website for further information on our research and recent publications. RESEARCH PROFILE AND PROJECT TOPICS: Dr Odom’s laboratory studies how genetic sequence information shapes the cell's DNA regulatory landscape and thus the trajectory of cancer genome evolution. Our long-standing interest in interspecies analysis of matched functional genomic data has revealed the extensive and rapid turn-over of tissue-specific transcription factor binding, insulator elements, polymerase occupancies, and enhancer activities during organismal evolution. My laboratory has begun exploiting single-cell RNA-sequencing and large-scale whole genome sequencing in understanding molecular and cancer genome evolution. Recent high-profile studies from the Odom lab used single-cell transcriptional analysis to conclusively demonstrate that ageing results in substantial increases in cell-to-cell transcriptional variability and exploited chemical carcinogenesis in mouse liver to reveal how the persistence of DNA lesions profoundly shapes tumour genomes. First, we are currently interviewing for experimentally focused postdoctoral fellows to spearhead multiple diverse and interdisciplinary collaborations in single-cell analysis, functional genomics, and spatial transcriptomics with Gerstung, Stegle, Hoefer, and Goncalves laboratories at DKFZ. Such an appointment would offer unusual opportunities for an experimentally-focused postdoc to learn and deploy bioinformatics tools in collaboration with one or more world-leading computational biology laboratories. Second, we are seeking a postdoctoral researcher to spearhead a collaboration with Edith Heard's group at EMBL, focused on how X-chromosome dynamics can shape disease susceptibility in older individuals.
    [Show full text]
  • Paterson Institute for Cancer Research Scientific Report 2008 Contents
    paterson institute for cancer research scientific report 2008 cover images: Main image supplied by Karim Labib and Alberto sanchez-Diaz (cell cycle Group). Budding yeast cells lacking the inn1 protein are unable to complete cytokinesis. these cells express a fusion of a green fluorescent protein to a marker of the plasma membrane, and have red fluorescent proteins attached to components of the spindle poles and actomyosin ring (sanchez-Diaz et al., nature cell Biology 2008; 10: 395). Additional images: front cover image supplied by Helen rushton, simon Woodcock and Angeliki Malliri (cell signalling Group). the image is of a mitotic spindle in fixed MDcK (Madin-Darby canine kidney) epithelial cells, which have been stained with an anti-beta tubulin antibody (green), DApi (blue) and an anti-centromere antibody (crest, red) which recognises the kinetochores of the chromosomes. the image was taken on the spinning disk confocal microscope using a 150 x lens. rear cover image supplied by Andrei ivanov and tim illidge (targeted therapy Group). Visualisation of tubulin (green) and quadripolar mitosis (DnA stained with DApi), Burkitt’s lymphoma namalwa cell after 10 Gy irradiation. issn 1740-4525 copyright 2008 © cancer research UK Paterson Institute for Cancer Research Scientific Report 2008 Contents 4 Director’s Introduction Researchers’ pages – Paterson Institute for Cancer Research 8 Crispin Miller Applied Computational Biology and Bioinformatics 10 Geoff Margison Carcinogenesis 12 Karim Labib Cell Cycle 14 Iain Hagan Cell Division 16 Nic Jones
    [Show full text]
  • Register of Declared Private, Professional Commercial and Other Interests
    REGISTER OF DECLARED PRIVATE, PROFESSIONAL COMMERCIAL AND OTHER INTERESTS Please refer to the attached guidance notes before completing this register entry. In addition to guidance on each section, examples of information required are also provided. Where you have no relevant interests in the relevant category, please enter ‘none’ in the register entry. Name: Michaela Frye Please list all MRC bodies you are a member of: E.g. Council, Strategy Board, Research Board, Expert Panel etc and your position (e.g. chair, member). Regenerative Medicine Research Committee member Main form of employment: Name of University and Department or other employing body (include location), and your position. University of Cambridge, Department of Genetics, Cancer Research UK Senior Fellow Research group/department web page: Provide a link to any relevant web pages for your research group or individual page on your organisation’s web site. http://www.stemcells.cam.ac.uk/research/affiliates/dr-frye http://www.gen.cam.ac.uk/research-groups/frye Please give details of any potential conflicts of interests arising out of the following: 1. Personal Remuneration: Including employment, pensions, consultancies, directorships, honoraria. See section 1 for further guidance. None 2. Shareholdings and Financial Interests in companies: Include the names of companies involved in medical/biomedical research, pharmaceuticals, biotechnology, healthcare provision and related fields where shareholdings or other financial interests. See section 2 for thresholds and further guidance. None 3. Research Income during current session (financial year) : Declare all research income from bodies supported by the MRC and research income from other sources above the limit of £50k per grant for the year.
    [Show full text]
  • Transcription Factor Binding Cells and Their Progeny
    RESEARCH HIGHLIGHTS knock-in alleles at the Lgr6 locus, allowing them to visualize Lgr6+ Transcription factor binding cells and their progeny. Shortly after birth, Lgr6 expression in the skin becomes restricted, eventually marking a unique population located at Most transcription factor binding events differ between closely the central isthmus just above the bulge region in resting hair follicles. related species despite conservation of functional targets. Now, Paul Flicek and Duncan Odom and colleagues report an Through a series of lineage-tracing studies in adult mice, they showed + evolutionary comparison of transcription factor CEBPA binding that the progeny of Lgr6 cells contribute not only to hair follicles but among five vertebrate species (Science published online, also to sebaceous glands and interfollicular epidermis. They also showed + doi:10.1126/science.1186176, 8 April 2010). The authors that transplanted Lgr6 stem cells could reconstitute fully formed hair compared genome-wide binding of CEBPA as determined by follicles and contribute to all three lineages in host skin. Finally, they + ChIP-sequencing in livers of human, mouse, dog, short-tailed showed that Lgr6 cells contribute to wound healing. These findings opossum and chicken. The CEBPA binding motif was virtually identify Lgr6 as a marker of a distinct epidermal stem cell population identical between the five species, but most binding events contributing to all three skin lineages. KV were species-specific. Only 10% to 22% of binding events were shared between any two placental mammals, and there was even greater divergence of binding events in opossum and chicken. Identifying nonobvious phenologs Only 35 binding events were shared by all five species.
    [Show full text]
  • Decoupling of Evolutionary Changes in Transcription Factor Binding and Gene Expression in Mammals
    Downloaded from genome.cshlp.org on April 13, 2016 - Published by Cold Spring Harbor Laboratory Press Research Decoupling of evolutionary changes in transcription factor binding and gene expression in mammals Emily S. Wong,1 David Thybert,1 Bianca M. Schmitt,2 Klara Stefflova,2,4 Duncan T. Odom,2,3 and Paul Flicek1,3 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom; 2University of Cambridge, Cancer Research UK - Cambridge Institute, Li Ka Shing Centre, Cambridge, CB2 0RE, United Kingdom; 3Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom To understand the evolutionary dynamics between transcription factor (TF) binding and gene expression in mammals, we compared transcriptional output and the binding intensities for three tissue-specific TFs in livers from four closely related mouse species. For each transcription factor, TF-dependent genes and the TF binding sites most likely to influence mRNA expression were identified by comparing mRNA expression levels between wild-type and TF knockout mice. Independent evolution was observed genome-wide between the rate of change in TF binding and the rate of change in mRNA ex- pression across taxa, with the exception of a small number of TF-dependent genes. We also found that binding intensities are preferentially conserved near genes whose expression is dependent on the TF, and the conservation is shared among binding peaks in close proximity to each other near the TSS. Expression of TF-dependent genes typically showed an increased sensitivity to changes in binding levels as measured by mRNA abundance.
    [Show full text]
  • Extensive Compensatory Cis-Trans Regulation in the Evolution of Mouse Gene Expression
    Downloaded from genome.cshlp.org on September 25, 2021 - Published by Cold Spring Harbor Laboratory Press Research Extensive compensatory cis-trans regulation in the evolution of mouse gene expression Angela Goncalves,1,5 Sarah Leigh-Brown,2,3,5 David Thybert,1,6 Klara Stefflova,3,6 Ernest Turro,2,3 Paul Flicek,1,4 Alvis Brazma,1 Duncan T. Odom,2,3,4,7 and John C. Marioni1,7 1European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom; 2Department of Oncology, University of Cambridge, Hills Road, Cambridge CB2 0XZ, United Kingdom; 3Cancer Research UK, Cambridge Research Institute, Robinson Way, Cambridge CB2 0RE, United Kingdom; 4Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom Gene expression levels are thought to diverge primarily via regulatory mutations in trans within species, and in cis between species. To test this hypothesis in mammals we used RNA-sequencing to measure gene expression divergence between C57BL/6J and CAST/EiJ mouse strains and allele-specific expression in their F1 progeny. We identified 535 genes with parent-of-origin specific expression patterns, although few of these showed full allelic silencing. This suggests that the number of imprinted genes in a typical mouse somatic tissue is relatively small. In the set of nonimprinted genes, 32% showed evidence of divergent expression between the two strains. Of these, 2% could be attributed purely to variants acting in trans, while 43% were attributable only to variants acting in cis. The genes with expression divergence driven by changes in trans showed significantly higher sequence constraint than genes where the divergence was explained by variants acting in cis.
    [Show full text]
  • Decoupling of Evolutionary Changes in Transcription Factor Binding and 10 Gene Expression in Mammals
    Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press 5 Decoupling of evolutionary changes in transcription factor binding and 10 gene expression in mammals 15 20 Emily S. Wong1, David Thybert1, Bianca M. Schmitt2, Klara Stefflova2,4, Duncan T. Odom*2,3, Paul Flicek*1,3 25 * Corresponding authors: DTO ([email protected]); PF ([email protected]) 30 1 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 2 University of Cambridge, Cancer Research UK - Cambridge Institute, Li Ka Shing Centre, 35 Cambridge, CB2 0RE, UK. 3 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. 4 Present address: Caltech Division of Biology and Biological Engineering, 1200 E California Blvd, Pasadena, CA, USA, 91125 40 Running title: Evolution of TF binding and gene expression Keywords: gene regulation, transcriptional evolution Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press ABSTRACT To understand the evolutionary dynamics between transcription factor (TF) binding and gene expression in mammals, we compared transcriptional output and the binding intensities for 5 three tissue-specific TFs in livers from four closely related mouse species. For each transcription factor, TF dependent genes and the TF binding sites most likely to influence mRNA expression were identified by comparing mRNA expression levels between wildtype and TF knockout mice. Independent evolution was observed genome-wide between the rate of change in TF binding and the rate of change in mRNA expression across taxa, with the 10 exception of a small number of TF dependent genes.
    [Show full text]
  • Interplay of Cis and Trans Mechanisms Driving Transcription Factor Binding and Gene Expression Evolution
    bioRxiv preprint doi: https://doi.org/10.1101/059873; this version posted July 29, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Interplay of cis and trans mechanisms driving transcription factor binding and gene expression evolution Emily S Wong1*, Bianca M Schmitt2*, Anastasiya Kazachenka3, David Thybert1, Aisling Redmond2, Frances Connor2, Tim F. Rayner2, Christine Feig2, Anne C. Ferguson-Smith3, John C Marioni1,2, Duncan T Odom2,4#, Paul Flicek1,4# * These authors contributed equally # Co-corresponding authors: DTO ([email protected]), PF ([email protected]). 1 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. 2 University of Cambridge, Cancer Research UK - Cambridge Institute, Li Ka Shing Centre, Cambridge, CB2 0RE, UK. 3 University of Cambridge, Department of Genetics, Cambridge, CB2 3EH, UK. 4 Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. 1 bioRxiv preprint doi: https://doi.org/10.1101/059873; this version posted July 29, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Abstract Noncoding regulatory variants play a central role in the genetics of human diseases and in evolution. Here we measure allele-specific transcription factor binding occupancy of three 5 liver-specific transcription factors between crosses of two inbred mouse strains to elucidate the regulatory mechanisms underlying transcription factor binding variations in mammals.
    [Show full text]
  • Tumours at Cellular Resolution
    Cancer Research UK Cambridge Institute Annual International Symposium Unanswered Questions: Tumours at Cellular Resolution 4 – 5 March 2016 Programme Programme CRUK CI Annual International Symposium 2016 11 Friday 4 March 9.30-9.35 Welcome from Simon Tavaré Session 1 Tissue architecture: imaging and bioengineering Chair: John Marioni Programme Cancer Research UK Cambridge Institute and EBI 9.35-9.40 Session introduction by John Marioni 9.40-10.10 Mikala Egeblad Cold Spring Harbor Laboratory Caught in the act: cancer cell activities revealed by imaging in live tissues 10.10-10.40 Lucas Pelkmans University of Zurich Control of transcript variability in single mammalian cells 10.40-10.55 Selected talk from submitted abstracts: to be confirmed 10.55-11.25 Coffee/tea 11.25-11.55 AlisonDRAFT McGuigan University of Toronto TRACER: An engineered tumour for exploring cellular phenotype and microenvironment in hypoxic gradients 11.55-12.25 Itai Yanai Technion - Israel Institute of Technology Deciphering the expression of all genes in all cells throughout the development of the nematode C. elegans 12.25-13.10 "Unanswered questions" panel discussion introduced by John Marioni, Cancer Research UK Cambridge Institute and EBI 13.10-15.00 Lunch (and poster session) 12 CRUK CI Annual International Symposium 2016 Session 2 Approaches to understanding tumours at cellular resolution Programme Chair: Simon Tavaré Cancer Research UK Cambridge Institute 15.00-15.05 Session introduction by Simon Tavaré 15.05-15.35 Rahul Satija New York University Learning the
    [Show full text]
  • Waves of Retrotransposon Expansion Remodel Genome Organization and CTCF Binding in Multiple Mammalian Lineages
    Waves of Retrotransposon Expansion Remodel Genome Organization and CTCF Binding in Multiple Mammalian Lineages Dominic Schmidt,1,2,6 Petra C. Schwalie,3,6 Michael D. Wilson,1,2 Benoit Ballester,3 Aˆ ngela Gonc¸alves,3 Claudia Kutter,1,2 Gordon D. Brown,1,2 Aileen Marshall,1,5 Paul Flicek,3,4,* and Duncan T. Odom1,2,4,* 1Cancer Research UK, Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK 2Department of Oncology, Hutchison/MRC Research Centre, University of Cambridge, Hills Road, Cambridge CB2 0XZ, UK 3European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 4Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 5Cambridge Hepatobiliary Service, Addenbrooke’s Hospital, Hills Road, Cambridge CB2 2QQ, UK 6These authors contributed equally to this work *Correspondence: fl[email protected] (P.F.), [email protected] (D.T.O.) DOI 10.1016/j.cell.2011.11.058 SUMMARY from fly to human (Burcin et al., 1997; Klenova et al., 1993; Moon et al., 2005). Originally identified as a transcriptional regu- CTCF-binding locations represent regulatory se- lator for the c-myc oncogene (Baniahmad et al., 1990; Filippova quences that are highly constrained over the course et al., 1996; Lobanenkov et al., 1990), CTCF remains the only of evolution. To gain insight into how these DNA identified sequence-specific DNA-binding protein that helps elements are conserved and spread through the establish vertebrate insulators (Bell et al., 1999). Additionally, genome, we defined the full spectrum of CTCF- CTCF has been implicated in transcriptional activation, repres- binding sites, including a 33/34-mer motif, and iden- sion, silencing, and imprinting of genes (Awad et al., 1999; Burcin et al., 1997; Filippova et al., 1996; Klenova et al., 1993; Lobanen- tified over five thousand highly conserved, robust, kov et al., 1990).
    [Show full text]