Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures

Total Page:16

File Type:pdf, Size:1020Kb

Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures M.P. Liang, D.L. Brutlag, R.B. Altman Pacific Symposium on Biocomputing 8:204-215(2003) AUTOMATED CONSTRUCTION OF STRUCTURAL MOTIFS FOR PREDICTING FUNCTIONAL SITES ON PROTEIN STRUCTURES M.P. LIANG, D.L. BRUTLAG, R.B. ALTMAN Stanford University, 251 Campus Drive X-215, Stanford, CA 94305-5479, USA E-mail: {mliang, brutlag, altman}@smi.stanford.edu Structural genomics initiatives are beginning to rapidly generate vast numbers of protein structures. For many of the structures, functions are not yet determined and high-throughput methods for determining function are necessary. Although there has been extensive work in function prediction at the sequence level, predicting function at the structure level may provide better sensitivity and predictive value. We describe a method to predict functional sites by automatically creating three dimensional structural motifs from amino acid sequence motifs. These structural motifs perform comparably well with manually generated structural motifs and perform better than sequence motifs. Automatically generated structural motifs can be used for structural-genomic scale function prediction on protein structures. 1 Introduction 1.1 Structural Genomics With the sequencing of the human genome and many other organisms, rapid determination of gene and protein function is becoming increasingly important. Structural genomics initiatives are helping elucidate the function of these gene products by developing high-throughput methods for determining structures for all unique protein folds. These new structure targets are specifically selected not to have sequence similarity to existing proteins1-4. With the anticipated explosion of available structures, it is imperative to develop computational methods for high- throughput function prediction on protein structures. 1.2 Sequence Motifs There has been extensive work in identifying conserved residues in protein sequences with similar function. Amino acid sequence patterns that represent these conserved residue positions can be created from multiple alignments of sequences with similar function. These patterns, or sequence motifs, can be used to assign function to sequences that contain the pattern. Numerous sequence motif databases have been established with different methods for creating sequence motifs. Some of the databases are manually curated by experts, while others are automatically derived. The BLOCKS+ database provides an integration of many sequence motif databases by generating blocks of ungapped alignment of sequences from families in the databases5, 6. The eMotif database uses sequence alignments from the BLOCKS+ database to create sequence motifs (eMotifs) at various specificities for a family of protein sequences. By providing motifs at different specificities for a block of sequences, eMotifs can represent families with high specificity and yet provide sensitivity by combining multiple motifs7, 8. Although sequence motifs can provide insight into protein function, when new proteins do not have significant sequence similarity with known proteins, using only sequence information fails. Proteins that do not have high sequence similarity may still have similar function because of conservation of physicochemical properties at the structural level9, 10. 1.3 Structural Motifs Analogous to sequence motifs, structural motifs provide a description of conserved properties in the three dimensional structure of proteins sharing molecular function. Investigators have devised different techniques to construct and define structural motifs; each technique emphasizes different conserved properties. Wallace et al have developed a system, PROCAT, for identifying catalytic sites by geometric orientation of residues with known functional importance. By using previous knowledge of the critical residues involved in the catalytic activity, a structural motif representing the conserved relative positions of those residues is constructed. This motif can be used to scan a new protein structure for occurrence of the catalytic site using a geometric hashing algorithm11, 12. Fetrow and Skolnick have developed Fuzzy Functional Forms (FFF) for representing distances between interesting residues. The critical residues involved in a functional site are identified by careful examination of the literature. Examples of known structures containing these residues are used to find mean distance and variance between the residues. The structural motif representing the conserved distance and variance of the residues is used to identify functional sites on protein structures13, 14. Wei and Altman have developed the FEATURE system that describes the physicochemical environment around functional sites. The environment, characterized by observing the frequency of physicochemical properties in radial shells around the site of interest, represents a structural motif used for predicting the functional sites15, 16. Unlike PROCAT and FFF, FEATURE does not require the conserved properties to be known in advance, but can discover them automatically given a training set. Once a set of examples of the functional site and a background control set is provided, FEATURE can automatically build the structural motif without having to manually identify which residues are considered important. We introduce a new method, SeqFEATURE, that automatically creates structural motifs around functional sites by characterizing the structural environment around sequence motifs using FEATURE. Since the three dimensional structures of protein are more conserved than the sequence of amino acids10, focusing on the protein structure should provide better sensitivity in identifying protein function. 2 Methods 2.1 SeqFEATURE overview SeqFEATURE is a method for automatically building a structural motif from sequence motifs (Figure 1). The sequence motifs can be obtained from any of the sequence motif databases; here we use the eMotif database. Protein structures that contain the sequence motif are identified and a training set for the FEATURE system is automatically generated. FEATURE then uses the training set to construct a structural motif describing the physicochemical environment around the sequence motif. Site description Motif Databases FEATURE training FEATURE scan Training Set SeqFEATURE Protein Sites Gridding pdbid (x y z) Feature Extraction Structure Site Selection pdbid (x y z) Sequence … Motifs Non-Sites Feature Extraction Non-site Selection pdbid (x y z) Model pdbid (x y z) prop-vol: value Functional … prop-vol: value volume values Sites prop-vol: value prop-vol: value Hits (x y z) score (x y z) score properties Score Calculation … Score Calculation Figure 1: Data flow diagram for SeqFEATURE. Sequence motifs are selected by querying a sequence motif database. These motifs are fed into SeqFEATURE for automatic selection of sites and non-sites to create a training set. The training set is used by FEATURE for creating a model of the microenvironment around the functional site. The model is used to predict functional sites on new protein structures. 2.2 FEATURE system The FEATURE system is detailed and evaluated elsewhere16; here we describe it briefly. The FEATURE system is used to build structural motifs and predict functional sites. FEATURE builds a physicochemical model of the environment around functional sites. The model represents the statistical distribution of physicochemical properties at radial distances from the site of interest. The physicochemical properties range from atom names and atom properties to residue names and secondary structure (Table 1). By using properties at the atomic level and not just at the residue level, FEATURE provides a closer view of the chemistry necessary for function. Table 1: List of FEATURE’s physicochemical properties AtomName C N O S ANY OTHER ChemicalGroup Hydroxyl Amide Amine Carbonyl RingSystem Peptide AtomProperties VDWVolume Charge NegCharge PosCharge ChargeWithHis Hydrophobicity Mobility SolventAccessibility ResidueName ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL HOH OTHER ResidueProperties Hydrophobic Charged Polar NonPolar Basic Acidic SecondaryStructure 3Helix 4Helix 5Helix Bridge Strand Turn Bend Coil Het Unknown FEATURE requires a training set composed of positive examples of the functional site and negative background control examples. Each example is a specific 3D coordinate in a protein structure locating the site of interest. The environment around each site of interest is divided into radial shell volumes. FEATURE then creates a feature vector v for each site by counting the occurrence of physicochemical properties within each volume around the site. The structural motif of the functional site is constructed by comparing the distribution of feature vectors from the positive examples (sites) with those from the negative examples (non-sites). Properties at each volume are evaluated as statistically more present or absent in the functional site by using the Wilcoxon rank sum test. Finally, using a Bayesian inference method, FEATURE predicts functional sites at specific locations on protein structures. It places a 3D grid over the new protein structure and calculates the feature vector described above from the environment around each grid point. The grid point is given a score proportional to the likelihood that it is a functional site given its feature vector v. p(Site | v) Score = log (1) p(Site) By assuming that the
Recommended publications
  • Representation for Discovery of Protein Motifs
    From: ISMB-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved. Representation for Discovery of Protein Motifs Darrell ConklinT, Suzanne Fortier~, Janice Glasgowt Departments~t of Computing and Information Sciencet and Chemistry Queen’s University Kingston, Ontario, Canada K7L 3N6 [email protected] Abstract pattern of amino acid residues. Protein motifs can be roughly classified into four categories. Sequence There are several dimensions and levels of com- motifs are linear strings of residues with an implicit plexity in which information on protein mo- topological ordering. Sequence-structure motifs are tifs may be available. For example, one- sequence motifs with secondary structure identifiers at- dimensional sequence motifs may be associated tached to one or more residues in the motif. Structure with secondary structure identifiers. Alterna- motifs are 3D structural objects, described by posi- tively, three-dimensional information on polypep- tions of residue objects in 3D Euclidian space. Apart tide segments may be used to induce prototypical from a topological linear ordering on the residues, three-dimensional structure templates. This pa- structure motifs are free of sequence information. Fi- per surveys various representations encountered nally, structure-sequence motifs are combined 1D- in the protein motif discovery literature. Many 3D structures that associate sequence information with of the representations are based on incompatible a structure motif. Figure 1 illustrates these four types semantics, making difficult the comparison and of protein motifs, along with somefurther subclassifi- combination of previous results. To make better cations which are elaborated upon later in the paper. use of machine learning techniques and to pro- The first three motif types are discussed by Thornton vide for an integrated knowledge representation and Gardner (1989); the structure-sequence motif will framework, a general representation language -- be presented in this paper.
    [Show full text]
  • Introduction to Sequence Motif Discovery and Search the Complete
    Introduction to sequence motif discovery and search EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid The complete genome era Components of transcriptional regulation Distal transcription-factor binding sites (enhancer) cis-regulatory modules Wasserman 5, 276-287 (2004) DNA-protein interaction Allen et al. © 1998 EMBL Jordan et al. © 1991 Prentice-Hall Sequence motifs sequence variants of site: ..A T C G C A.. ..T T G G A C.. ..T T G G T G.. ..A T C G G T.. matrix: A 2 0 0 0 1 1 (simplest version) C 0 0 2 0 1 1 G 0 0 2 4 1 1 T 2 4 0 0 1 1 + cutoff ! sequence logo: Title Representation of the binding specificity by a scoring matrix (also referred to as weight matrix) 1 2 3 4 5 6 7 8 9 A -10 -10 -14 -12 -10 5 -2 -10 -6 C 5 -10 -13 -13 -7 -15 -13 3 -4 G -3 -14 -13 -11 5 -12 -13 2 -7 T -5 5 5 5 -10 -9 5 -11 5 Strong C T T T G A T C T Binding site 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43 Random A C G T A C G T A Sequence -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83 Biophysical interpretation of protein binding sites Columns of a weight matrix characterize the specificity of base-pair acceptor sites on the protein surface. Weight matrix elements represent negated energy contributions to the total binding energy → weight matrix score inversely proportional to binding energy Motif search: Statistical over-representation of genomic sequence motifs Conserved motifs represent protein binding sites Putative conservation in nucleotide sequence (motif) position relative to: TSS other binding sites (protein complexes)
    [Show full text]
  • Tools for Motif and Pattern Searching
    TOOLS FOR MOTIF AND PATTERN SEARCHING Prat Thiru OUTLINE • What are motifs? • Algorithms Used and Programs Available • Workflow and Strategies • MEME/MAST Demo (online and command line) Protein Motifs DNA Motifs MEME Output Definitions • Motif: Conserved regions of protein or DNA sequences • Pattern: Qualitative description of a motif eg. regular expression C[AT]AAT[CG]X •Profile: Quantitative description of a motif eg. position weight matrix Patterns • Regular Expression Symbols ¾[ ] – OR eg. [GA] means G or A ¾{ } – NOT eg. {P,V} means not P or V ¾( ) – repeats eg. A(3) means AAA ¾X or N or “.” – any • Complex patterns representation difficult • Loose frequency information eg. [AT] vs 20%A 80%T Profiles Sequence Logos Algorithms • Enumeration • Probabilistic Optimization • Deterministic Optimization 1. Identify motifs 2. Build a consensus Enumeration • Exhaustive search: word counting method, count all n-mers and look for overrepresentation • Less likely to get stuck in a local optimum • Computationally expensive ¾YMF http://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl ¾Weeder http://159.149.109.9/weederaddons/locator.html Probabilistic Optimization • Uses a Gibbs sampling approach • One n-mer from each sequence is randomly picked to determine initial model. In subsequent iterations, one sequence, i, is removed and the model is recalculated. Pick a new location of motif in sequence i iterate until convergence • Assumes most sequences will have the motif ¾ AlignAce http://atlas.med.harvard.edu/cgi-bin/alignace.pl ¾ Gibbs Motif Sampler http://bayesweb.wadsworth.org/gibbs/gibbs.html Deterministic Optimization • Based on expectation maximization (EM) • EM: iteratively estimates the likelihood given the data that is present I.
    [Show full text]
  • Degsampler: Gibbs Sampling Strategy for Predicting E3-Binding Sites with Position-Specific Prior Information
    DegSampler: Gibbs sampling strategy for predicting E3-binding sites with position-specic prior information Osamu Maruyama ( [email protected] ) Kyushu University https://orcid.org/0000-0002-5760-1507 Fumiko Matsuzaki Kyushu University Research article Keywords: motif, substrate protein, E3 ubiquitin ligase, prior, posterior, disorder, collapsed, Gibbs sampling, DegSampler Posted Date: November 6th, 2020 DOI: https://doi.org/10.21203/rs.3.rs-101967/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Maruyama and Matsuzaki RESEARCH DegSampler: Gibbs sampling strategy for predicting E3-binding sites with position-specific prior information Osamu Maruyama1* and Fumiko Matsuzaki2 Background Eukaryotic cells have two major pathways for degrad- ing proteins in order to control cellular processes and maintain intracellular homeostasis. One of the path- ways is autophagy, a catabolic process that deliv- ers intracellular components to lysosomes or vacuoles Abstract [1]. The other pathway is the ubiquitin-proteasome Background: The ubiquitin-proteasome system is system, which degrades polyubiquitin-tagged pro- a pathway in eukaryotic cells for degrading teins through the proteasomal machinery [2]. In the polyubiquitin-tagged proteins through the ubiquitin-proteasome system, an E3 ubiquitin ligase proteasomal machinery to control various cellular (hereinafter E3) selectively recognizes and binds to processes and maintain intracellular homeostasis. specific regions of the target substrate proteins. These In this system, the E3 ubiquitin ligase (hereinafter binding sites are called degrons [3]. E3) plays an important role in selectively It is important to identify the E3s and substrate recognizing and binding to specific regions of its proteins that interact with one another to character- substrate proteins.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]
  • RNA Structural Motif Recognition Based on Least-Squares Distance
    Downloaded from rnajournal.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press BIOINFORMATICS RNA structural motif recognition based on least-squares distance YING SHEN,1 HAU-SAN WONG,2,4 SHAOHONG ZHANG,3 and LIN ZHANG1 1School of Software Engineering, Tongji University, Shanghai 200092, China 2Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 3Department of Computer Science, Guangzhou University, Guangzhou 510006, China ABSTRACT RNA structural motifs are recurrent structural elements occurring in RNA molecules. RNA structural motif recognition aims to find RNA substructures that are similar to a query motif, and it is important for RNA structure analysis and RNA function prediction. In view of this, we propose a new method known as RNA Structural Motif Recognition based on Least-Squares distance (LS-RSMR) to effectively recognize RNA structural motifs. A test set consisting of five types of RNA structural motifs occurring in Escherichia coli ribosomal RNA is compiled by us. Experiments are conducted for recognizing these five types of motifs. The experimental results fully reveal the superiority of the proposed LS-RSMR compared with four other state-of-the-art methods. Keywords: RNA structural motif; RNA motif recognition INTRODUCTION RNA molecule (i.e., the search space). There are two prevalent ways to search for RNA structural motifs. The first class of LongnoncodingRNA(ncRNA),whichconsistsof>200ntand methods is to find motifs based on their geometric features. has little or no protein-coding capability, is a remarkable class of RNAs. They are important in various biological processes NASSAM (Harrison et al. 2003) represents RNA motifs and (Dinger et al.
    [Show full text]
  • Sequence Conservation
    2/14/17 Sequence conservation Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, FeBruary 16th 2017 After this lecture, you can… … list the mechanisms of DNA mutation … sort different biological information levels By conservation … discuss the multiple roles of protein domains and draw the parallel with computer scripting … use conservation to identify functional importance … make and interpret consensus sequences … make and interpret sequence logos & information content … recognize and explain why TFBS are often palindromic … explain the functional importance of unexpected variation 1 2/14/17 Evolution • Replication (copy) • Mutation (modify) • Selection (fitness) Time Mutations • Nucleotide suBstitutions – Replication error – Physical or chemical reaction • Insertions or deletions (indels) – Unequal crossing over during meiosis – Replication slippage • Inversions or rearrangements • Duplications of: – Partial or whole gene – Partial (polysomy) or whole chromosome (aneuploidy, polysomy) – Whole genome (polyploidy) • Horizontal gene transfer (HGT) – Conjugation (direct transfer between Bacteria) – Transformation by naturally competent Bacteria – Transduction by bacteriophages 2 2/14/17 Phenotypic/genotypic similarity • We exploit similarity to… …identify homology (shared ancestry) …determine evolutionary relationships …transfer functional information • Sequences (genotype) rarely converge • Functions (phenotype) can converge • Analogy • Homology – Similar function – Similar ancestry Low-complexity regions •
    [Show full text]
  • Cross-Talk Between Overlap Interactions in Biomolecules: a Case Study of the Β-Turn Motif
    molecules Article Cross-Talk between Overlap Interactions in Biomolecules: A Case Study of the b-Turn Motif Jayashree Nagesh Solid State and Structural Chemistry Unit, Indian Institute of Science Bangalore, Bengaluru 560012, Karnataka, India; [email protected] Abstract: Noncovalent interactions play a pivotal role in regulating protein conformation, stability and dynamics. Among the quantum mechanical (QM) overlap-based noncovalent interactions, n ! p∗ is the best understood with studies ranging from small molecules to b-turns of model proteins such as GB1. However, these investigations do not explore the interplay between multiple overlap interactions in contributing to local structure and stability. In this work, we identify and characterize all noncovalent overlap interactions in the b-turn, an important secondary structural element that facilitates the folding of a polypeptide chain. Invoking a QM framework of natural bond orbitals, we demonstrate the role of several additional interactions such as n ! s∗ and p ! p∗ that are energetically comparable to or larger than n ! p∗. We find that these interactions are sensitive to changes in the side chain of the residues in the b-turn of GB1, suggesting that the n ! p∗ may not be the only component in dictating b-turn conformation and stability. Furthermore, a database search of n ! s∗ and p ! p∗ in the PDB reveals that they are prevalent in most proteins and have significant interaction energies (∼1 kcal/mol). This indicates that all overlap interactions must be taken into account to obtain a comprehensive picture of their contributions to protein structure and energetics. Lastly, based on the extent of QM overlaps and interaction energies, we propose geometric criteria using which these additional interactions can be efficiently tracked in broad database searches.
    [Show full text]
  • Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences
    Syracuse University SURFACE Electrical Engineering and Computer Science - Dissertations College of Engineering and Computer Science 8-2012 Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences Faisal Abdulmalek Alobaid Syracuse University Follow this and additional works at: https://surface.syr.edu/eecs_etd Part of the Computer Engineering Commons Recommended Citation Alobaid, Faisal Abdulmalek, "Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences" (2012). Electrical Engineering and Computer Science - Dissertations. 322. https://surface.syr.edu/eecs_etd/322 This Dissertation is brought to you for free and open access by the College of Engineering and Computer Science at SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science - Dissertations by an authorized administrator of SURFACE. For more information, please contact [email protected]. ABSTRACT A simple motif is a short DNA sequence found in the promoter region and believed to act as a binding site for a transcription factor protein. A structured motif is a sequence of simple motifs (boxes) separated by short sequences (gaps). Biologists theorize that the presence of these motifs play a key role in gene expression regulation. Discovering these patterns is an important step towards understanding protein-gene and gene-gene interaction thus facilitates the building of accurate gene regulatory network models. DNA sequence motif extraction is an important problem in bioinformatics. Many studies have proposed algorithms to solve the problem instance of simple motif extraction. Only in the past decade has the more complex structured motif extraction problem been examined by researchers. The problem is inherently challenging as structured motif patterns are segmented into several boxes separated by variable size gaps for each instance.
    [Show full text]
  • ER-Resident Transcription Factor Nrf1 Regulates Proteasome Expression and Beyond
    International Journal of Molecular Sciences Review ER-Resident Transcription Factor Nrf1 Regulates Proteasome Expression and Beyond Jun Hamazaki * and Shigeo Murata Laboratory of Protein Metabolism, Graduate School of Pharmaceutical Sciences, the University of Tokyo, Tokyo 1130033, Japan; [email protected] * Correspondence: [email protected] Received: 4 May 2020; Accepted: 20 May 2020; Published: 23 May 2020 Abstract: Protein folding is a substantively error prone process, especially when it occurs in the endoplasmic reticulum (ER). The highly exquisite machinery in the ER controls secretory protein folding, recognizes aberrant folding states, and retrotranslocates permanently misfolded proteins from the ER back to the cytosol; these misfolded proteins are then degraded by the ubiquitin–proteasome system termed as the ER-associated degradation (ERAD). The 26S proteasome is a multisubunit protease complex that recognizes and degrades ubiquitinated proteins in an ATP-dependent manner. The complex structure of the 26S proteasome requires exquisite regulation at the transcription, translation, and molecular assembly levels. Nuclear factor erythroid-derived 2-related factor 1 (Nrf1; NFE2L1), an ER-resident transcription factor, has recently been shown to be responsible for the coordinated expression of all the proteasome subunit genes upon proteasome impairment in mammalian cells. In this review, we summarize the current knowledge regarding the transcriptional regulation of the proteasome, as well as recent findings concerning the regulation of Nrf1 transcription activity in ER homeostasis and metabolic processes. Keywords: proteasome; Nrf1; DDI2; bounce back; proteostasis 1. Introduction Proteins are essential to the vast majority of cellular and organismal functions. Cellular protein levels are defined by the balance among protein synthesis, folding, and degradation, and the failure to accurately coordinate these processes leads to disease [1].
    [Show full text]
  • Lecture 7: Sequence Motif Discovery
    Sequence motif: definitions COSC 348: Computing for Bioinformatics • In Bioinformatics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has Lecture 7: been proven or assumed to have a biological significance. Sequence Motif Discovery • Once we know the sequence pattern of the motif, then we can use the search methods to find it in the sequences (i.e. Lubica Benuskova Boyer-Moore algorithm, Rabin-Karp, suffix trees, etc.) • The problem is to discover the motifs, i.e. what is the order of letters the particular motif is comprised of. http://www.cs.otago.ac.nz/cosc348/ 1 2 Examples of motifs in DNA Sequence motif: notations • The TATA promoter sequence is an example of a highly • An example of a motif in a protein: N, followed by anything but P, conserved DNA sequence motif found in eukaryotes. followed by either S or T, followed by anything but P − One convention is to write N{P}[ST]{P} where {X} means • Another example of motifs: binding sites for transcription any amino acid except X; and [XYZ] means either X or Y or Z. factors (TF) near promoter regions of genes, etc. • Another notation: each ‘.’ signifies any single AA, and each ‘*’ Gene 1 indicates one member of a closely-related AA family: Gene 2 − WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRN Gene 3 Gene 4 • In the 1st assignment we have motifs like A??CG, where the Gene 5 wildcard ? Stands for any of A,U,C,G. Binding sites for TF 3 4 Sequence motif discovery from conservation Motif discovery based on alignment • profile analysis is another word for this.
    [Show full text]
  • Comparison of GHT-Based Approaches to Structural Motif Retrieval
    Comparison of GHT-Based Approaches to Structural Motif Retrieval Alessio Ferone1 and Ozlem Ozbudak2 1 University of Naples Parthenope, Department of Applied Science, Centro Direzionale Napoli - Isola C4, 80143, Napoli, Italy [email protected] 2 Istanbul Technical University, Department of Electronics and Communication Engineering, 34469, Istanbul, Turkey [email protected] Abstract. The structure of a protein gives important information about its function and can be used for understanding the evolutionary relation- ships among proteins, predicting protein functions, and predicting pro- tein folding. A structural motif is a compact 3D protein block referring to a small specific combination of secondary structural elements which appears in a variety of molecules. In this paper we present a compari- son between few approaches for motif retrieval based on the Generalized Hough Transform (GHT). Performance comparisons, in terms of preci- sion and computation time, are presented considering the retrieval of motifs composed by three to five SSs for more than 15 million searches. The approaches object of this study can be easily applied to the retrieval of greater blocks, up to protein domains, or even entire proteins. Keywords: Hough transform, Protein motif retrieval, Protein structure comparison. 1 Introduction Proteins are central molecules in biological phenomena because they form the functional and structural cell components of every organisms and their function is determined, to a large extend, by their spatial structures. Starting from the linear sequence of amino acid given in Protein Data Bank (PDB) [1], two basic regular 3D structures can be envisaged [11], called SSs: helices and sheets.Small specific combinations of SSs, which appear in a variety of molecules, are called motifs, and can be considered as super-SSs [12].
    [Show full text]