Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences

Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms....................... 11 2.5 Methods for inferring selection on binding sites................. 11 2.6 Figures....................................... 12 3 Early Embryo Chromatin Landscape 17 3.1 Introduction.................................... 17 3.2 Results....................................... 19 3.3 Discussion..................................... 22 3.4 Methods of Embryo Collection and Treatment................. 23 3.5 Methods of Comparing DNase-seq........................ 26 3.6 Figures....................................... 27 3.7 Tables....................................... 35 4 Interaction of Chromatin and Expression in Early Embryogenesis 36 4.1 Introduction.................................... 36 4.2 Results....................................... 38 4.3 Discussion..................................... 38 4.4 Methods...................................... 40 4.5 Figures....................................... 40 Contents iii Bibliography 44 iv List of Figures 2.1 Histogram of allele frequencies from two populations of N. American strains. The alleles found from the Raleigh population measured with short reads correlate highly with alleles in the Winters population from Sanger sequencing...... 12 2.2 Distribution of Tajima’s D values for classes of D. melanogaster DNA. Bar are given with whiskers indicating two standard errors. Transcription factors regions for each factor were appended together........................ 13 2.3 Strength of AP transcription factors binding in relationship to population motif diversity in bound regions. Strength of binding is measured by the maximum fragment depth of the region and motifs were considered as binary absent or present if the motif passed a p-value threshold................... 14 2.4 Binding of AP transcription factors in relationship of excess low frequency motifs D as calculated in equation 2.1........................... 15 2.5 Transcription factor binding sites are under similar selective pressures as the entire cis-regulatory module. Each motif was shuffled by column followed by row and scored against the bound regions using the same thresholds as the canonical motif. The score distributions were compared with Wilcoxon rank-sum test and no significant differences were found......................... 16 3.1 DGRP strains used for mixture combinations. Each strain was mixed with all others at least once and biological replicates for DGRP705 mixed with DGRP380 and DGRP437 were collected and sequenced..................... 28 3.2 DNase landscape at eve. A subset of the strain mixtures shows the range of accessibilities and general consistency between strains............... 29 3.3 DNase signal at TF peaks from stage 5 embryos.................. 30 3.4 DNase I cleavage (per nucleotide per 1M reads) at ChIP-seq identified ZLD binding sites. Red line is the average cuts on the positive strand and green represents DNase cuts mapping to the negative strand..................... 31 3.5 Kmer effects on chromatin accessibility....................... 32 3.6 Fold change in accessibility at kmers with largest mean effects on chromatin state. 32 3.7 Change in accessibility around differentially accessibility SNPs. The difference in expected accessibility corrected for mixture ratios around SNPs which deviated significantly from expected ratios........................... 33 List of Figures v 3.8 Regions with conserved and variable chromatin accessibility are both under purifying selection. The mean pairwise diversity around the peak of chromatin accessible regions were smoothed with LOWESS for each class of chromatin accessible regions, shared within all strains in red and polymorphic in blue......... 34 4.1 RNA fragment distribution between stain mixtures, including biological replicates. 41 4.2 Differential Expression in mixture of strains with alleles classified into matching strains 324 and 303. Red transcripts are expressed higher in 324, green transcripts are expressed higher in 303 with an FDR of 0.05 calculated with EBSeq..... 42 4.3 Expression of AP and DV target genes. Of the 7064 expressed genes, 4% show differential expression................................ 42 4.4 Differential chromatin accessibility within 2Kb of TSS and expression per transcript. The colors of each transcript match Figure 4.2. All correlations of expression and DNase accessibility per transcript partitioned into increased expression, decreased expression, no change, and all transcripts were non-significant, p = 0.06 to 0.90...................................... 43 vi List of Tables 3.1 Kmer effects on chromatin accessibility....................... 35 vii Acknowledgments I would like to acknowledge the following individuals for their invaluable contributions to this work and my personal growth in and out of the lab: Matt Davis, Xiao-Yong Li, Tommy Kaplan, Jacqueline Villalta, Mike Eisen, Devin Scannell, Doris Bachtrog, John Pool, Ryan Shultzaberger, Colin Brown, Susan Lott, Rich Lusk, Kelly Schiabor, Peter Combs, Dan Richter, Matt Taliaferro, Oh-Kyu Yoon, Tera Levin, Heather Bruce, Adam Session, Lior Pachter, Adam Roberts, Christopher Villalta, Eric Buhlis, Gary Karpen, Craig Miller, Robert Fowler, and many other members of the Eisen Lab, the Department of Molecular and Cell Biology, and the community of the University of California. 1 Chapter 1 Introduction The power of Selection, whether exercised by man or brought into play under na- ture through the struggle for existence and the consequent survival of the fittest, absolutely depends on the variability of organic beings. Without variability, noth- ing can be effected; slight individual differences, however, suffice for the work, and are probably the chief or sole means in the production of new species. (Charles Darwin 1868[1]) When Charles Darwin described the mechanisms of evolution[2], he had no knowledge of the genetic mechanisms of inheritance. He proposed gemmules[1], small packages carrying information for each part of the organism to the germ cells where they were collected and integrated forming the next generation. Darwin’s erroneous solution was motivated by the puzzle as to how quantitative and discrete heritable variation could function. We now know that both result from the interaction of a multitude of genes and their variants [3]. These networks of interactions have explained several of Darwin’s original observations of variability in and between species. The heritable variance that lead to differences in finch beaks[4] and dog varieties[5] have been traced down to the level of individual nucleotides through the study of the molecule of inheritance we now know to be DNA[6]. DNA contains the information for producing enzymatic and structural proteins and ul- timately all metabolic processes and products are encoded in DNA. DNA is heritable and is passed down from parents to offspring over generations, accumulating changes in both the regions which encode proteins and the regulatory regions that control when, where, how much, and what form of protein is produced. DNA can be modified in the short term through the addition of chemical modifications such as methylation[7] but these effects do not persist over evolutionary time scales[8], only the changes of the sequence itself. Understanding how the primary sequence of DNA is interpreted and translated into the complex form of life has been one of the major scientific efforts since the 1950s. As time and resources were put into determining the genomes of species, we collected genomes on larger scales starting with small viruses with only 4 genes[9], bacteria[10], and finally model eukaryotes such

Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences

Enhanced Representation of Natural Product Metabolism in Uniprotkb

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

A Multiscale Tool to Explore Genomic Conservation

A Zebrafish Reporter Line Reveals Immune and Neuronal Expression of Endogenous Retrovirus

Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice

A Burst of Protein Sequence Evolution and a Prolonged Period of Asymmetric Evolution Follow Gene Duplication in Yeast

A Phd Position Is Available in the Research Group of Aoife Mclysaght

Molecular Genetics & Genomics

Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A

Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]

Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

Generating Functional Protein Variants with Variational Autoencoders