Identification of Cis-Regulatory Sequence
Total Page:16
File Type:pdf, Size:1020Kb
Worsley-Hunt et al. Genome Medicine 2011, 3:65 http://genomemedicine.com/content/3/9/65 REVIEW Identication of cis-regulatory sequence variations in individual genome sequences Rebecca Worsley-Hunt1,2, Virginie Bernard1 and Wyeth W Wasserman1* Abstract for the appropriate patterning or magnitude of gene expression. Similarly, mutations can disrupt other sequence- Functional contributions of cis-regulatory sequence specific regulatory controls, such as elements regulating variations to human genetic disease are numerous. For RNA splicing or stability. Although much emphasis in the instance, disrupting variations in a HNF4A transcription age of exome sequencing has been placed on variation factor binding site upstream of the Factor IX gene within protein-encoding sequences, it is apparent that contributes causally to hemophilia B Leyden. Although regulatory sequence disruptions will become a key focus clinical genome sequence analysis currently focuses as full genome sequences become widely accessible for on the identication of protein-altering variation, the medical genetics research. impact of cis-regulatory mutations can be similarly e emerging collection of cis-regulatory variations strong. New technologies are now enabling genome that cause human disease or altered phenotype is grow- sequencing beyond exomes, revealing variation across ing [2-4]. Reports identify cis-acting, expression-altering the non-coding 98% of the genome responsible for mutations observed within introns, far upstream of developmental and physiological patterns of gene genes, at splicing sites or within microRNA target sites. activity. The capacity to identify causal regulatory For example, cis-regulatory mutations have roles in hemo- mutations is improving, but predicting functional philia, Gilbert’s syndrome, Bernard-Soulier syndrome, changes in regulatory DNA sequences remains a great irritable bowel syndrome, beta-thalassemia, cholesterol challenge. Here we explore the existing methods and homeostasis and altered limb formation [5-11]. e software for prediction of functional variation situated number of cis-regulatory variants reported in the litera- in the cis-regulatory sequences governing gene ture has continued to expand over the past 2 years [12-18]. transcription and RNA processing. In addition, compilations of cis-regulatory variants have been reported [4,19,20]. Although many studies associate cis-regulatory variations with phenotype, it is rare for Introduction researchers to conclusively demonstrate causality. e Cis-regulatory sequences control when, where and with strongest causal evidence is obtained with transgenic what intensity genes are expressed. Key to this control is approaches, in cell culture or animal models, to identify the recruitment of transcription factors (TFs) that bind phenotypes triggered by such variations [21,22]. e to regulatory sequences, such as promoters, enhancers, impor tance of regulatory changes is nevertheless apparent. repressors and insulators. ese target sequences are Ultimately genetics researchers seeking regulatory spread across DNA. Probably reflecting the secondary muta tions are best served by high-quality annotations of structure properties of looped DNA within a nucleus, there the human genome, with clearly designated functional are confirmed cases of cis-regulatory elements up to about elements. Most routinely expressed protein coding exons 106 bp distant from the transcription initiating promoter are known, making initial identification of protein- of a gene [1]. Mutations in TF binding sites (TFBSs) can altering genetic changes simple. In contrast, despite disrupt the essential protein-DNA inter actions required ongoing ambitious efforts to annotate non-coding genome features, the inventory of cis-regulatory elements is far from complete. Large-scale chromatin immuno- *Correspondence: [email protected] precipitation (ChIP) experiments provide the vast 1Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, majority of data, eclipsing the compiled information of 950 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada the past 25 years derived from targeted studies of specific Full list of author information is available at the end of the article regulatory elements. Many of the new ChIP-derived data, however, highlight segments of DNA (about 200 to © 2010 BioMed Central Ltd © 2011 BioMed Central Ltd 1,000 bp) containing a functional element rather than a Worsley-Hunt et al. Genome Medicine 2011, 3:65 Page 2 of 14 http://genomemedicine.com/content/3/9/65 1.4 2.1 1.1 1.2 2.2 Exon Intron Exon 1.3 Transcription 3.1 4.1 4.2 3.2 Translation AAAAA AUG 5’ Cap 5’ UTR 3’ UTR RNA stability Transcription 3.1 RNA structure 1.1 Insulator region 3.2 Polyadenylation site 1.2 Enhancer/repressor region 1.3 TFBSs 1.4 Initiation regions/core promoter Translation 4.1 Initation site 4.2 Termination site Splicing 2.1 Splice junctions 2.2 Splicing enhancers/suppressors Non-coding RNA 5.1 ncRNA targets not shown Figure 1. Classification of the regulatory sequences in a gene that can be modified by genetic variation. The four colored groups reflect mutations that have a deleterious impact on gene transcription (class 1 in red), splicing (class 2 in blue), RNA stability (class 3 in orange) and translation (class 4 in green). A fifth class, non-coding RNA interaction sites, is not shown as sites can occur throughout DNA and RNA sequences. specific element (average <15 bp). Similarly, DNase I A workflow for identifying disease-causing hyper sensitivity analysis specifies regions likely to contain cis-regulatory variants regulatory elements [23]. Thus, the experi mentally An example workflow for the identification of cis- defined regions must be coupled to additional methods regulatory TFBSs linked to a disease is outlined in the to assess the potential for a specific DNA variation to followingNon- coding section, and is illustrated in Figure 2. Given a affect gene activity. Some of the key data resources set of high-throughput sequencing data from an reporting regulatory regions and delineating specific individual,RNA the short sequences are individually aligned to elements are introduced below. a5.1 reference human genome sequence. Sets of overlapping Our perspective is biased to elements with sequence- readsncRNA are analyzed to determine the genotype at each specific properties, including TFBSs, microRNA, splice- position for which sufficient aligned sequences are regulating target sequences, and immediate core- available.targets Common polymorphisms and rare variants are promoter sequences critical to the initiation of trans- distinguishednot relative to the reference genome. For cription (Figure 1). Although diverse types of cis-regula- familialshown studies, variations segregating with the pheno- tory variations will become accessible for future studies, type can be determined, and researchers emerge with a at present the bioinformatics resources for the study of set of disease-causing candidates. For each candidate for variation within TFBSs are the most accessible and there- causality, it is desirable to assess the potential for the fore our primary focus here. sequence mutation to disrupt a biological function. Many We begin by outlining an example workflow. Then we researchers will be content to focus on the subset of step through elements of the workflow in greater detail, candidates predicted to create a severe alteration in a including a brief overview of the discovery of sequence protein sequence. Software for the prediction of damag- variations from high-throughput sequence data. Finally, ing changes, including modules from SIFT [24] and we review TFBS identification approaches and strategies PolyPhen-2 [25], identify variants that substitute amino for the prioritization of cis-regulatory variations for acids expected to cause changes in protein structure, further analysis. We conclude with a brief mention of two alter a critical position in a protein domain, or change an emerging experimental techniques that may be used in amino acid of high evolutionary conservation. Many the future to associate cis-regulatory variants and their researchers will stop at this step. gene targets. Our aim is to assist medical genetics For those interested in potential cis-regulatory changes, researchers to identify potential regulatory variants a panel of computational analyses can be performed on within non-coding regions of the human genome. the candidate variations. At the core of the processes is a Worsley-Hunt et al. Genome Medicine 2011, 3:65 Page 3 of 14 http://genomemedicine.com/content/3/9/65 Identify variants in genome Identify regulatory elements Coding variant characterization ChIP-Seq data Motif discovery TFBS profiles * Known common variants Known rare Classify variants variants Genetic variant data Unknown variants Select variants within regulatory elements TFBS variant characterization data management and visualization Use regulatory region filter(s) to refine variant list Data management and visualization Experimental analysis on final list of variants Figure 2. Overview of a workflow for cis-regulatory variant detection. The boxes represent steps in the workflow, and the italicized descriptions under the boxes correspond to analysis resources listed in Table 1. For identification of regulatory elements and regions, the order in the workflow may be changed without loss of information. Common variants, flagged by