Characterizing the Dna-Binding Site Specificities of Cis2his2 Zinc Fingers

Total Page:16

File Type:pdf, Size:1020Kb

Characterizing the Dna-Binding Site Specificities of Cis2his2 Zinc Fingers MQP-ID-DH-UM1 C H A R A C T E RI Z IN G T H E DN A-BINDIN G SI T E SPE C I F I C I T I ES O F C IS2H IS2 Z IN C F IN G E RS A Major Qualifying Project Report Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degrees of Bachelor of Science in Biochemistry and Biology and Biotechnology by _________________________ Heather Bell April 26, 2012 APPROVED: ____________________ ____________________ ____________________ Scot Wolfe, PhD Destin Heilman, PhD David Adams, PhD Gene Function and Exp. Biochemistry Biology and Biotech UMass Medical School WPI Project Advisor WPI Project Advisor MAJOR ADVISOR A BST R A C T The ability to modularly assemble Zinc Finger Proteins (ZFPs) as well as the wide variety of DNA sequences they can recognize, make ZFPs an ideal framework to design novel DNA-binding proteins. However, due to the complexity of the interactions between residues in the ZF recognition helix and the DNA-binding site there is currently no comprehensive recognition code that would allow for the accurate prediction of the DNA ZFP binding motifs or the design of novel ZFPs for a desired target site. Through the analysis of the DNA-binding site specificities of 98 ZFP clones, determined through a bacterial one-hybrid selection system, a predictive model was created that can accurately predict the binding site motifs of novel ZFPs. 2 T A B L E O F C O N T E N TS Signature Page ««««««««««««««««««««««««««« $EVWUDFW«««««««««««««««««««««««««««««« 7DEOHRI&RQWHQWV«««««««««««««««««««««««««« $FNQRZOHGJHPHQWV««««««««««««««««««««««««« %DFNJURXQG«««««««««««««««««««««««««««« Project Purpose «««««««««««««««««««««««««««15 0HWKRGV««««««««««««««««««««««««««««««16 5HVXOWV««««««««««««««««««««««««««««««21 'LVFXVVLRQ«««««««««««««««««««««««««««««28 Bibliograph\«««««««««««««««««««««««««««« 6XSSOHPHQWDO««««««««««««««««««««««««««« 3 A C K N O W L E D G E M E N TS I would like to thank Dr. Scot Wolfe and the University of Massachusetts Medical School for sponsoring this project and providing guidance along the way. I would also like to thank Ankit Gupta for his guidance in carrying out this product as well as The Joung Lab and The Stormo Lab for their collaboration with the project. Lastly, I would like to thank Dr. Dave Adams and Dr. Destin Heilman, my WPI major advisors, for advising this project and helping to edit the final report. 4 PR OJE C T PURPOSE The purpose of this project was to create a comprehensive ZFP predicative model that could be used to both accurately predict the DNA-binding sites of ZFPs, as well as to aid in designing ZFPs for a desired target site. This predictive model was developed based upon Bacterial One-Hybrid (B1H) binding site selections on 116-engineered ZFP clones based upon the structural framework of Zif268 and was created in collaboration with the laboratories of Keith Joung and Gary Stormo. 5 B A C K G R O UND T H E DISC O V E R Y O F Z IN C FIN G E R PR O T E INS Through the work of Robert Roeder and Donald Brown on the 5S RNA genes of Xenopus laevis, it was discovered that the binding of a 40-kDa protein factor, called transcription factor IIIA (TFIIIA), is required for the initiation of transcription. Through further experiments Roeder and Brown determined that TFIIIA interacts with a 50- nucleotide long region within the gene, called the internal control region, resulting in TFIIIA being the first eukaryotic transcription factor ever described. In 1982, further studies of TFIIIA by Miller, Roeder, and Brown showed that the transcription factor contained a high concentration of zinc and could be broken down into nine tandemly repeating units of 30-amino acids in length with each unit containing a similar pattern of cysteines and histidines. The remarkable repeating motif within TFIIIA was later termed a zinc finger (ZF). The protein was given this name due to the presence of the zinc ion and the manner with which the zinc finger gripped the DNA (Klug, 2010). The ability to modularly assemble Zinc Finger Proteins (ZFPs) as well as the wide variety of DNA sequences they can recognize, make ZFPs an ideal framework to design novel DNA-binding proteins. Of the 30 amino acids in each ZF, 25 fold around a Zn ion WRIRUPDQLQGHSHQGHQWVWUXFWXUDOGRPDLQRU³WKHILQJHU´DQGWKHUHPDLQLQJDPLQR acids serve a linker between consecutive fingers. In later studies by Neuhaus, it was shown that this linker is extremely flexible and can vary depending on the organism. The discovery of the flexibility of the linker also showed that tandemly linked ZFs are structurally independent of one another and thus opened the door to a whole new world of DNA recognition and DNA binding proteins (Klug, 2010). 6 Since the discovery of TFIIIA, three major classes of ZFs have been defined. The first class is characterized by the presence of a Cys6-Zn cluster motif and is found in metabolic regulators of Fungi. The second class consists of Cys2Cys2 (or Cys4) ZFs with a conserved Zn-binding consensus of Cys-X2-Cys-X13-Cys-X2-Cys and is found mainly in nuclear steroid or hormone receptors. The third class contains the most common ZF, the Cys2His2 =)RUWKH³FODVVLFDO´=)GRPDLQDQGFDQEHIRXQGLQDODUJHQXPEHURI regulatory proteins spanning almost all branches of the evolutionary tree (Papworth et al, 2006). DN A BINDIN G O F C YS2 H IS2 Z IN C FIN G E R PR O T E INS Cys2His2 Zinc Fingers are the most common DNA-binding domain in metazoan genomes and are naturally occurring in animals, plants, and fungi. They are the best understood out of the three major classes of ZFs and are therefore the most widely used for designing proteins with novel DNA-binding specificities (Wolfe et al., 2000). Cys2His2 Zinc Fingers contain a conserved Cys-&\V«+LV-His pattern with the zinc ion tetrahedrally coordinated between the two cysteines at one end of the E-strands and the two histidines in the C-terminal portion of the D-helix to stabilize its folds (figure 1). In addition to the conserved Cys-Cys and His-His pattern, each ZF also contains Figure 1: The structure of a Zinc Finger from a two- dimensional NMR study of three conserved large hydrophobic residues: typically Tyr6, two-finger peptide in solution (Klug, 2010). Phe17, and Leu23, which help to stabilize the finger structure. Therefore, the conserved sequence of Cys2His2 ZFs can be written as (Tyr, Phe)-X-Cys- 7 X2-5-Cys-X3-(Tyr, Phe)-X5-Leu-X2-His-X3-5-His, with X representing any amino acid (Klug, 2010). Additionally, ZFs can be linked tandemly in a linear, polar fashion to recognize DNA (or RNA) sequences of different lengths as well as proteins (Wolfe et al., 2000). However, while each finger domain has a similar structural framework, it can achieve chemical distinctiveness through variations in the 4 key amino acid positions used in DNA-binding (Klug, 2010). In 1991, Pavletich and Pabo conducted an in depth study of the crystal structure of a DNA oligonucleotide bound to a three-finger DNA-binding domain of the mouse transcription factor Zif268. From this study, Pavletich and Pabo showed that the primary sequence-specific contacts are made by the D-helix, which binds in the major groove of the DNA. This binding occurs through specific hydrogen-bond or hydrophobic interactions from amino acids in the recognition helix at positions -1, 3, and 6 with the three positions contacting three consecutive basepairs on one strand of the DNA. Later studies showed that the amino acid in the recognition helix at position 2 also plays an important role in DNA binding, contacting a single basepair on the opposite DNA strand as positions -1, 3, and 6. These studies also demonstrated the importance of cross-strand interaction between the D-helices of tandemly linked ZFs due to the fact that the complement of the basepair contacted by position 2 of the first finger is contacted by position 6 of the second finger and so on (figure 2) (Klug, 2010). 8 Figure 2: Schematic of the DNA-binding of the mouse transcription factor, Zif268. Contacts with the DNA are made from positions -1, 3, and 6 of the recognition helix on one strand of the DNA with the residue at position 2 contacting the complement strand. Positions -1, 3, and 6 contact three consecutive basepairs while position two contacts the complement of the basepair contacted by position 6 of the next finger. 7KHWKUHHILQJHUPRGXOHELQGVZLWKSRVLWLRQRIILQJHUWKUHHFRQWDFWLQJWKH¶HQGRI the DNA and position -RIILQJHUFRQWDFWLQJWKH¶HQGRIWKH'1$ ,PDJHWDNHQ from Klug, 2010). The most common linker arrangement found in ZFs contains five residues between the final histidine of one finger and the first conserved aromatic of the next. Roughly half of the known ZFs contain a linker with the consensus sequence TGEKP. This particular linker is flexible in the free ZFP but becomes rigid upon binding of the ZFP to DNA. The docking of adjacent fingers in a ZFP are stabilized further by a contact LQYROYLQJWKHVLGHFKDLQRIWKHUHVLGXHDWSRVLWLRQRIWKHSUHFHGLQJILQJHU¶VKHOL[DORQJ with the backbone carbonyl or side chain at position -2 of the subsequent finger. This contact using position 9 has been observed to correlate with the use of a canonical linker such as TGEKP. Due to this interfinger contact at position 9 and the highly conserved TGEKP linger, it can be implied that the interfinger organization is important in DNA recognition as well (Wolfe et al., 2000). 9 The mode of DNA recognition can be considered a one-to-one interaction between individual amino acids from the recognition helix to individual DNA bases (Wolfe et al., 2000). This combined with the fact that ZFs function as independent modules, allows for the design of ZFPs that will recognize longer DNA sequences through combining several ZFs with different triplet specificities.
Recommended publications
  • Department of Energy Office of Health and Environmental Research SEQUENCING the HUMAN GENOME Summary Report of the Santa Fe Workshop March 3-4, 1986
    Department of Energy Office of Health and Environmental Research SEQUENCING THE HUMAN GENOME Summary Report of the Santa Fe Workshop March 3-4, 1986 Los Alamos National Laboratory Los Alamos Los Alamos, New Mexico 87545 Los Alamos National Laboratory is operated by the University of California for the United States Department of Energy under contract W-7405-ENG-36. DEPARTMENT OF ENERGY OFFICE OF HEALTH AND ENVIRONMENTAL RESEARCH SEQUENCING THE HUMAN GENOME SUMMARY REPORT ON THE SANTA FE WORKSHOP (MARCH 3-4, 1986) Executive Summary. The following is a summary of the Santa Fe Workshop held on March 3 and 4, 1986. The workshop was sponsored by the Office of Health and Environmental Research (OHER) and Los Alamos National Laboratory (LANL) and dedicated to examining the feasibility, advisability, and approaches to sequencing the human genome. The workshop considered four principal topics: I. Technologies to be employed. II. Expected benefits. III. Architecture of the enterprise. IV. Participants and funding. I . Technology The participants of the workshop foresaw extraordinary and continuing progress in the efficiency and accuracy of mapping, ordering , and sequencing technologies. They suggested that a coordinated analysis of the human genome begin with the task of ordering overlapping recombinant DNA fragments obtained from purified human chromosomes that would provide an infrastructure for sequencing activity. At the same time, they support in-depth evaluation of current and developing strategies for sequencing including possible applications of automation and robotics that would minimize the time and cost of sequencing. II. Benefits The socio-political and health benefits, and the benefit:cost ratio were seen as highly favorable not only for human health, but in addition for the development of new diagnostic, preventative and therapeutic tools, jobs, and industries.
    [Show full text]
  • Mapping Our Genes—Genome Projects: How Big? How Fast?
    Mapping Our Genes—Genome Projects: How Big? How Fast? April 1988 NTIS order #PB88-212402 Recommended Citation: U.S. Congress, Office of Technology Assessment, Mapping Our Genes-The Genmne Projects.’ How Big, How Fast? OTA-BA-373 (Washington, DC: U.S. Government Printing Office, April 1988). Library of Congress Catalog Card Number 87-619898 For sale by the Superintendent of Documents U.S. Government Printing Office, Washington, DC 20402-9325 (order form can be found in the back of this report) Foreword For the past 2 years, scientific and technical journals in biology and medicine have extensively covered a debate about whether and how to determine the function and order of human genes on human chromosomes and when to determine the sequence of molecular building blocks that comprise DNA in those chromosomes. In 1987, these issues rose to become part of the public agenda. The debate involves science, technol- ogy, and politics. Congress is responsible for ‘(writing the rules” of what various Federal agencies do and for funding their work. This report surveys the points made so far in the debate, focusing on those that most directly influence the policy options facing the U.S. Congress, The House Committee on Energy and Commerce requested that OTA undertake the project. The House Committee on Science, Space, and Technology, the Senate Com- mittee on Labor and Human Resources, and the Senate Committee on Energy and Natu- ral Resources also asked OTA to address specific points of concern to them. Congres- sional interest focused on several issues: ● how to assess the rationales for conducting human genome projects, ● how to fund human genome projects (at what level and through which mech- anisms), ● how to coordinate the scientific and technical programs of the several Federal agencies and private interests already supporting various genome projects, and ● how to strike a balance regarding the impact of genome projects on international scientific cooperation and international economic competition in biotechnology.
    [Show full text]
  • ISMB 99 August 6 – 10, 1999 Heidelberg, Germany the Seventh
    ______________________________________ Welcome to ISMB 99 August 6 – 10, 1999 Heidelberg, Germany The Seventh International Conference on Intelligent Systems for Molecular Biology ______________________________________ Final Program and Detailed Schedule Friday, August 6, 1999 Tutorial Day The tutorials will take place in the following rooms: 8:30 – 12:30 (Coffee break around 10:30) Tutorial #1 Trübnersaal Piere Baldi Probabilistic graphical models Tutorial #2 Robert-Schumann-Zimmer Douglas L. Brutlag Bioinformatics and Molecular Biology Tutorial #3 Ballsaal Martin Reese The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Tutorial #4 Gustav-Mahler-Zimmer Tandy Warnow Computational and statistical Junhyong Kim challenges involved in reconstructing evolutionary trees Tutorial #5 Sebastian-Münster-Saal Thomas Werner The biology and bioinformatics of regulatory regions in genomes Lunch (on this day served in "Grosser Saal" on the ground floor) 13:30 – 17:30 (Coffee break around 15:30) Tutorial #6 Sebastian-Münster-Saal Rob Miller EST Clustering Alan Christoffels Winston Hide Tutorial #7 Trübnersaal Kevin Karplus Getting the most out of hidden Markov Melissa Cline models Christian Barrett Tutorial #8 Robert-Schumann-Zimmer Arthur Lesk Sequence-structure relationships and evolutionary structure changes in proteins Tutorial #9 Gustav-Mahler-Zimmer David States PERL abstractions for databases and Brian Dunford distributed computing Shore Tutorial # 10 Ballsaal Zoltan Szallasi Genetic network analysis
    [Show full text]
  • Download File
    Topics in Signal Processing: applications in genomics and genetics Abdulkadir Elmas Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2016 c 2016 Abdulkadir Elmas All Rights Reserved ABSTRACT Topics in Signal Processing: applications in genomics and genetics Abdulkadir Elmas The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown param- eters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies.
    [Show full text]
  • Baldi Bioinformatics
    Vol. 15 no. 11 1999 BIOINFORMATICS Pages 865–866 A.M. Shmatkov, A.A. Melikyan, F.L. Chernousko and Editorial M. Borodovsky’s paper, ‘Finding prokaryotic genes by the “frame-by-frame” algorithm: targeting gene starts THE SECOND GEORGIA TECH INTERNA- and overlapping genes’, addresses the important practical TIONAL CONFERENCE ON BIOINFORMATICS: problem of gene recognition in prokaryotic genomes, SEQUENCE, STRUCTURE AND FUNCTION of which over a dozen have already been completely (NOVEMBER 11–14, 1999, ATLANTA, GEORGIA, sequenced. The authors suggest an approach to one of USA) the few remaining open problems in prokaryotic gene finding: accurate prediction of gene starts allowing for Steering & Program Committee: Pierre Baldi, Mark the possibility of overlapping protein-coding regions, a Borodovsky, Soren Brunak, Chris Burge, Jim Fickett, relatively common occurrence in prokaryotic genomes Steven Henikoff, Eugene Koonin, Andrej Sali, Chris which seems to be rare in eukaryotes. Their algorithm Sander, Gary Stormo involves application of a hidden Markov model of gene structure to each of the six global reading frames (three on This issue of Bioinformatics contains reports on selected each strand) of a genome separately, followed by a simple papers presented at the international conference in At- post-processing step to remove completely overlapping lanta. The conference was held at one of the midtown genes which rarely occur in nature. Promising results are hotels offering a magnificent bird’s-eye view to the obtained in identifying gene starts, and the possibility of cosmopolitan capital of the Southeast of the USA. The systematic biases in the annotation of several bacterial conference agenda included keynote lectures by Russell genomes is raised.
    [Show full text]
  • Gene Regulatory Network Inference Using Machine Learning Techniques
    GENE REGULATORY NETWORK INFERENCE USING MACHINE LEARNING TECHNIQUES Stephanie Kamgnia Wonkap A Thesis in the department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy(Computer Science) Concordia University Montreal,´ Quebec,´ Canada August 26 2020 c Stephanie Kamgnia Wonkap, 2020 Concordia University School of Graduate Studies This is to certify that the thesis prepared By: Miss. Stephanie Kamgnia Wonkap Entitled: Gene Regulatory Network Inference using Machine Learning Techniques and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: Chair Dr. Liangzhu Wang External Examiner Dr. Mathieu Blanchette Examiner Dr. Leila Kosseim Examiner Dr. Malcolm Whiteway Examiner Dr. Volker Haarslev Supervisor Dr. Gregory Butler Approved Dr. Leila Kosseim, Graduate Program Director August 26th, 2020 Dr. Amir Asif, Dean Date of Defence Faculty of Engineering and Computer Science Abstract Gene Regulatory Network Inference using Machine Learning Techniques Stephanie Kamgnia Wonkap, Ph.D. Concordia University, 2020 Systems Biology is a field that models complex biological systems in order to better understand the working of cells and organisms. One of the systems modeled is the gene regulatory network that plays the critical role of controlling an organism's response to changes in its environment. Ideally, we would like a model of the complete gene regulatory network. In recent years, several advances in technology have permitted the collection of an unprecedented amount and variety of data such as genomes, gene expression data, time-series data, and perturbation data.
    [Show full text]
  • Highlights (PDF)
    GENETICS A PERIODICAL RECORD OF INVESTIGATIONS BEARING ON HEREDITY AND VARIATION Founded in 1916 and published by The Genetics Society of America VOLUME 179, MAY–AUGUST 2008 GENETICS VOLUME 179, MAY–AUGUST 2008 EDITORIAL BOARD Elizabeth W. Jones, Editor-in-Chief Carnegie Mellon University Mark Johnston, Acting Editor-in-Chief Washington University School of Medicine Montserrat Aguade´ Kent Golic Rasmus Nielsen Universitat de Barcelona University of Utah University of Copenhagen, Centre for Bioinformatics Eric E. Alani Susan Gottesman Cornell University National Institutes of Health-NCI Michael Nonet Washington University School of Medicine Kathryn V. Anderson David I. Greenstein Sloan-Kettering Institute University of Minnesota Magnus Nordborg University of Southern California Brenda J. Andrews David Jonah Grunwald University of Utah University of Toronto Peter J. Oefner hris aley Robert R. H. Anholt C H Stanford University Roslin Institute (Edinburgh) North Carolina State University Andrew Paterson ichael ampsey Elja Arjas M H University of Georgia University of Helsinki Robert Wood Johnson Medical School-UMDNJ David Rand orman rnheim Brown University N A awrence arshman University of Southern California L G. H University of Nebraska, Lincoln Eric J. Richards onnie artel Washington University B B ancy ollingsworth Rice University N H Stony Brook University Mark D. Rose David Begun afri umayun Princeton University University of California, Davis M. Z H UMDNJ-New Jersey Medical School Paul Russell ames irchler J A. B ancy enkins The Scripps Research Institute University of Missouri N A. J National Cancer Institute-FCRDC Matthew S. Sachs arl roman K W. B homas aufman Texas A&M University University of Wisconsin, Madison T C.
    [Show full text]
  • 2014 ISCB Accomplishment by a Senior Scientist Award: Gene Myers
    Message from ISCB 2014 ISCB Accomplishment by a Senior Scientist Award: Gene Myers Christiana N. Fogg1, Diane E. Kovats2* 1 Freelance Science Writer, Kensington, Maryland, United States of America, 2 Executive Director, International Society for Computational Biology, La Jolla, California, United States of America The International Society for Computa- tional Biology (ISCB; http://www.iscb. org) annually recognizes a senior scientist for his or her outstanding achievements. The ISCB Accomplishment by a Senior Scientist Award honors a leader in the field of computational biology for his or her significant contributions to the com- munity through research, service, and education. Dr. Eugene ‘‘Gene’’ Myers of the Max Planck Institute of Molecular Cell Biology and Genetics in Dresden has been selected as the 2014 ISCB Accom- plishment by a Senior Scientist Award winner. Myers (Image 1) was selected by the ISCB’s awards committee, which is chaired by Dr. Bonnie Berger of the Massachusetts Institute of Technology (MIT). Myers will receive his award and deliver a keynote address at ISCB’s 22nd Image 1. Gene Myers. Image credit: Matt Staley, HHMI. Annual Intelligent Systems for Molecular doi:10.1371/journal.pcbi.1003621.g001 Biology (ISMB) meeting. This meeting is being held in Boston, Massachusetts, on July 11–15, 2014, at the John B. Hynes guidance of his dissertation advisor, An- sequences and how to build evolutionary Memorial Convention Center (https:// drzej Ehrenfeucht, who had eclectic inter- trees. www.iscb.org/ismb2014). ests that included molecular biology. Myers landed his first faculty position in Myers was captivated by computer Myers, along with fellow graduate students the Department of Computer Science at programming as a young student.
    [Show full text]
  • Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families Ryan Christensen Washington University in St
    Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families Ryan Christensen Washington University in St. Louis Follow this and additional works at: https://openscholarship.wustl.edu/etd Recommended Citation Christensen, Ryan, "Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families" (2011). All Theses and Dissertations (ETDs). 564. https://openscholarship.wustl.edu/etd/564 This Dissertation is brought to you for free and open access by Washington University Open Scholarship. It has been accepted for inclusion in All Theses and Dissertations (ETDs) by an authorized administrator of Washington University Open Scholarship. For more information, please contact [email protected]. WASHINGTON UNIVERSITY IN ST. LOUIS Division of Biology and Biomedical Sciences Computational Biology Dissertation Examination Committee: Gary D. Stormo, Chair Michael R. Brent Jeremy D. Buhler James J. Havranek Garland R. Marshall Robi D. Mitra PROTEIN-DNA RECOGNITION MODELS FOR THE HOMEODOMAIN AND C2H2 ZINC FINGER TRANSCRIPTION FACTOR FAMILIES by Ryan Goaslind Christensen A dissertation presented to the Graduate School of Arts and Sciences of Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy August 2011 Saint Louis, Missouri ABSTRACT OF THE DISSERTATION Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families By Ryan Goaslind Christensen Doctor of Philosophy in Biology and Biomedical Sciences Computational Biology Washington University in St. Louis, 2011 Professor Gary D. Stormo, Chairperson Transcription factors (TFs) play a central role in the gene regulatory network of each cell.
    [Show full text]
  • Reading Genomes Bit by Bit
    Reading genomes bit by bit Sean Eddy HHMI Janelia Farm Ashburn, Virginia Why did we sequence so many different flies? the power of comparative genome sequence analysis Why did we sequence a single-celled pond protozooan? exploiting unusual adaptations and unusual genomes Oxytricha trifallax Symbolic texts can be cracked by statistical analysis “Cryptography has contributed a new weapon to the student of unknown scripts.... the basic principle is the analysis and indexing of coded texts, so that underlying patterns and regularities can be discovered. If a number of instances can be collected, it may appear that a certain group of signs in the coded text has a particular function....” John Chadwick, The Decipherment of Linear B Cambridge University Press, 1958 Linear B, from Mycenae ca. 1500-1200 BC deciphered by Michael Ventris and John Chadwick, 1953 How much data are we talking about, really? STORAGE TIME TO COST/YEAR DOWNLOAD raw images 30 TB $36,000 20 days unassembled reads 100 GB $120 1 hr mapped reads 100 GB $120 1 hr assembled genome 6 GB $7 5 min differences 4 MB $0.005 0.2 sec my coffee coaster selab:/misc/data0/genomes 3 GB 450 GB JFRC computing, available disk ~ 1 petabyte (1000 TB) 1000 Genomes Project pilot 5 TB (30 GB/genome) selab:~eddys/Music/iTunes 128 albums 3 MB 15 GB NCBI Short Read Archive 200 TB + 10-20 TB/mo GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGG CGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATG
    [Show full text]
  • Footer: a Quantitative Comparative Genomics Method for Efficient Recognition of Cis-Regulatory Elements
    Methods Footer: A quantitative comparative genomics method for efficient recognition of cis-regulatory elements David L. Corcoran,1,2 Eleanor Feingold,1,2 Jessica Dominick,2 Marietta Wright,4 Jo Harnaha,4 Massimo Trucco,4 Nick Giannoukakis,4 and Panayiotis V. Benos2,3,5 1Department of Biostatistics, Graduate School of Public Health, 2Department of Human Genetics, GSPH, and 3Department of Computational Biology and University of Pittsburgh Cancer Institute, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15621, USA; 4Children’s Hospital of Pittsburgh, Pittsburgh 15213, USA The search for mammalian DNA regulatory regions poses a challenging problem in computational biology. The short length of the DNA patterns compared with the size of the promoter regions and the degeneracy of the patterns makes their identification difficult. One way to overcome this problem is to use evolutionary information to reduce the number of false-positive predictions. We developed a novel method for pattern identification that compares a pair of putative binding sites in two species (e.g., human and mouse) and assigns two probability scores based on the relative position of the sites in the promoter and their agreement with a known model of binding preferences. We tested the algorithm’s ability to predict known binding sites on various promoters. Overall, it exhibited 83% sensitivity and the specificity was 72%, which is a clear improvement over existing methods. Our algorithm also successfully predicted two novel NF-␬B binding sites in the promoter region of the mouse autotaxin gene (ATX, ENPP2), which we were able to verify by using chromatin immunoprecipitation assay coupled with quantitative real-time PCR.
    [Show full text]
  • Combined Evidence Annotation of Transposable Elements in Genome Sequences
    Combined Evidence Annotation of Transposable Elements in Genome Sequences Hadi Quesneville1*, Casey M. Bergman2¤, Olivier Andrieu1, Delphine Autard1, Danielle Nouaud1, Michael Ashburner2, Dominique Anxolabehere1 1 Laboratoire Dynamique du Ge´nome et Evolution, Institut Jacques Monod, Paris, France, 2 Department of Genetics, University of Cambridge, Cambridge, United Kingdom Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated ‘‘TE models’’ in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n ¼ 6,013) than previously identified (n ¼ 1,572).
    [Show full text]