Le G´Enome En Action
Total Page:16
File Type:pdf, Size:1020Kb
LEGENOME´ EN ACTION SEQUENC´ ¸ AGE HAUT DEBIT´ ET EPIG´ ENOMIQUE´ Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs}ur¨os Regulation´ d’expression la transcription d’une region´ de l’ADN necessite´ ? liaisons proteine-ADN´ (facteur de transcription et son site reconnu) ? accessibilite´ de la chromatine REVIEWS Identification of regions that control transcription An initial step in the analysis of any gene is the identifi- cation of larger regions that might harbour regulatory control elements. Several advances have facilitated the prediction of such regions in the absence of knowl- edge about the specific characteristics of individual cis- Chromatin regulatory elements. These tools broadly fall into two categories: promoter (transcription start site; TSS) and enhancer detection. The methods are influenced Distal TFBS by sequence conservation between ORTHOLOGOUS genes (PHYLOGENETIC FOOTPRINTING), nucleotide composition and the assessment of available transcript data. Functional regulatory regions that control transcrip- tion rates tend to be proximal to the initiation site(s) of transcription. Although there is some circularity in the Co-activator complex data-collection process (regulatory sequences are sought near TSSs and are therefore found most often in these regions), the current set of laboratory-annotated regula- tory sequences indicates that sequences near a TSS are Transcription more likely to contain functionally important regulatory initiation complex Transcription controls than those that are more distal. However, specifi- initiation cation of the position of a TSS can be difficult. This is fur- ther complicated by the growing number of genes that CRM Proximal TFBS selectively use alternative start sites in certain contexts. Underlying most algorithms for promoter prediction is a Figure 1 | Components of transcriptional regulation. Transcription factors (TFs) bind reference collection known as the ‘Eukaryotic Promoter to specific sites (transcription-factor binding sites; TFBS) that are either proximal or Database’ (EPD)4. Early bioinformatics algorithms that distal to a transcription start site. Sets of TFs can operate in functional cis-regulatory were used to pinpoint exact locations for TSSs were modules (CRMs) to achieve specific regulatory properties. Interactions between bound TFs plagued by false predictions5.These TSS-detection tools and cofactors stabilize the transcription-initiation machinery to enable gene expression. were frequently based on the identification of TATA-box The regulation that is conferred by sequence-specific binding TFs is highly dependent on the three-dimensional structure of chromatin. sequences, which are often located ~30 bp upstream of a TSS. The leading TATA-box prediction method6,reflect- ing the promiscuous binding characteristics of the TATA- binding protein, predicts TATA-like sequences nearly does not reveal the entire picture. There is only partial Wassermanevery 250 bp in & long Sandelin genomeNat sequences. Rev Genet 5 :276 (2004) correlation between transcript and protein concentra- A new generation of algorithms has shifted the tions3.Nevertheless, the selective transcription of genes emphasis to the prediction of promoters — that is, by RNA polymerase-II under specific conditions is cru- regions that contain one or more TSS(s). Given that cially important in the regulation of many, if not most, many genes have multiple start sites, this change in genes, and the bioinformatics methods that address the focus is biochemically justified. initiation of transcription are sufficiently mature to The dominant characteristic of promoter sequences Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs}ur¨os influence the design of laboratory investigations. in the human genome is the abundance of CpG dinu- ii Below, we introduce the mature algorithms and cleotides. Methylation plays a key role in the regulation online resources that are used to identify regions that of gene activity. Within regulatory sequences, CpGs ORTHOLOGY regulate transcription. To this end, underlying meth- remain unmethylated, whereas up to 80% of CpGs in Two sequences are orthologous if they share a common ancestor ods are introduced to provide the foundation for other regions are methylated on a cytosine. Methylated and are separated by speciation. understanding the correct use and limitations of each cytosines are mutated to adenosines at a high rate, approach. We focus on the analysis of cis-regulatory resulting in a 20% reduction of CpG frequency in PHYLOGENETIC FOOTPRINTING sequences in metazoan genes, with an emphasis on sequences without a regulatory function as compared An approach that seeks to methods that use models that describe transcription- with the statistically predicted CpG concentration7. identify conserved regulatory elements by comparing genomic factor binding specificity. Methods for the analysis of Computationally, the CG dinucleotide imbalance can be sequences between related regulatory sequences in sets of co-regulated genes will a powerful tool for finding regions in genes that are species. be addressed elsewhere. We use a case study of the human likely to contain promoters8. skeletal muscle troponin gene TNNC1 to demonstrate Numerous methods have been developed that MACHINE LEARNING The ability of a program to learn the specific execution of the described methods. A set of directly or indirectly detect promoters on the basis of from experience — that is, to accompanying online exercises provides the means for the CG dinucleotide imbalance. Although complex modify its execution on the basis researchers to independently explore some of the meth- computational MACHINE-LEARNING algorithms have been of newly acquired information. ods highlighted in this review (see online links box). directed towards the identification of promoters, simple In bioinformatics, neural Because the field is rapidly changing, emerging classes of methods that are strictly based on the frequency of CpG networks and Monte Carlo Markov Chains are well-known software will be described in anticipation of the creation dinucleotides perform remarkably well at correctly pre- examples. of accessible online analysis tools. dicting regions that are proximal to or that contain the NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 277 © 2004 Nature Publishing Group Annotation d’activite´ genomique´ on veut annoter : ? sites de liaison de facteurs de transcription ? chromatine «ouverte» ? methylation´ de l’ADN ? modification de histones ? interaction ADN-ADN Hawkins, Hon & Ren Nat Rev Genet 11 :476 (2010) Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs}ur¨os iii On peut faire tout par sequenc¸age.´ . 1. filtrage / enrichissement de regions´ d’interetˆ 2. sequenc¸age´ 3. alignement pour determiner´ d’ou` viennent les morceaux Wold & Myers Nature Methods 5 :19 (2008) Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs}ur¨os iv MNase-seqSUPPLEMENTARY : position INFORMATION des nucleosomes´doi:10.1038/nature10002 MNase : Supplementary Figure 1 CD4+ T Lymphocytes In vivo nucleosome mapping Gradient-based and IG-bead cell sorting CD8+ T Lymphocytes Granulocytes Lyse the cells Isolate and sequence Micrococcal mononucleosome cores nuclease Supplementary Figure 1. Schematic depiction of in vivo nucleosome mapping experiment. Blood cells were isolated from a human donor blood and sorted into populations representing CD4+ T-cells, CD8+ T-cells and granulocytes. Nuclear chromatin was released by crushing the cells, followed by Micrococcal nuclease treatment. Mononucleosome fraction was isolated by gel electrophoresis and sequenced to high depth using SOLiD platform. Valouev & al Nature 474 :516 (2011) Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs}ur¨os v WWW.NATURE.COM/NATURE | 1 LETTERS NATURE | Vol 458 | 19 March 2009 in vitro yeast-based model and that of in vivo nucleosome occupancy organization in several growth conditions, with local, condition-spe- in C. elegans19 (Fig. 3f). Moreover, our model classifies nucleosome- cific changes superimposed. enriched regions from nucleosome-depleted regions in C. elegans with To address concerns regarding biases that may be caused by the high accuracy (Supplementary Fig. 4), and the 5-base-pair sequence sequence specificity of micrococcal nuclease20 and possible biases in preferences of the C. elegans in vivo map agree well with those of the parallel sequencing, we performed a different kind of in vitro experi- yeast in vitro map (Fig. 3g). The poorer classification performance in ment that measures the relative nucleosome affinity of ,40,000 dou- comparison with yeast may indicate that factors other than the DNA ble-stranded 150-bp oligonucleotides without the use of micrococcal sequence preferences make a greater contribution to nucleosome nuclease or parallel sequencing. The resulting 5-base-pair nucleo- organization in more complex eukaryotes. Alternatively, the poorer some sequence preferences are in excellent agreement with those performance may indicate that distinct sequence types are present in discovered in the genome-wide in vitro reconstitution (correlation C. elegans for which our yeast in vitro data do not provide statistics. of 0.83), and there is a good correlation (0.51) between the measured Nonetheless, our model is significantly correlated with the in vivo oligonucleotide affinities and those predicted by the model con- nucleosome organization across C. elegans. structed from the genome-wide in vitro map (Supplementary Fig. We next compared the DNA-encoded nucleosome