Basics on Bioinformafics Lecture 8

Total Page:16

File Type:pdf, Size:1020Kb

Basics on Bioinformafics Lecture 8 Basics on bioinforma-cs Lecture 8 Nunzio D’Agostino [email protected]; [email protected] Protein domain: terminology Superfamily: Proteins that have low sequence identity, but whose structural and functional features suggest a common evolutionary origin. Family: Proteins clustered together into families are clearly evolutionarily related (accepted rule: pairwise residue identity between the proteins >30% ). Domain: A domain is defined as a polypeptide chain or a part of polypeptide chain that can be independently fold into a stable tertiary structure. Domains are also units of function and are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins. Motif: A pattern of amino acids that is conserved across many proteins and confers a particular function to the protein. Site: Is the binding site where catalysis occurs. The structure and chemical properties of the active site allow the recognition and binding of the substrate. 2 Why protein domain identification? By iden*fying domains we can: o classify a new protein as belonging to a specific family o infer func*onality o infer cellular localizaon of a protein 3 Domain representation-patterns Some biologically significant amino acid paerns (mo*fs) can be summarised in the form of regular expressions. A regular expression is a powerful notaonal algebra that describes a string or a set of strings. One can use them whenever he/she wants to find paerns in strings. The standard notaons for describing regular expressions use these convenons: [AS] = A and S allowed. D = D allowed. x = Any symbol. x4 = Four arbitrary symbols. {PG} = Any symbol except P and G. [FY]2 = Two posi*ons where F and Y allowed. x(3,7) = Minimum 3 and maximum 7 residues. 4 Domain representation- patterns MSA Detect func*onally important residues Paern as regular expression: AVL]-L-[IV]-M-[TS]-C-[DE]-R-[FY]2- Q 5 Domain representation- PSSM/profile A PSSM is a Posi2on Specific Scoring Matrix. A profile is one type of PSSM. Sequence profile (Gribskov et al. 1987) is essen*ally a table that lists the frequencies of each amino acid in each posi*on of protein sequence. Frequencies are calculated from mul*ple alignments of related sequences (containing a domain of interest). PSSM scores are generally shown as posi*ve or negave integers. Posi*ve scores indicate that the given amino acid substuon occurs more frequently in the alignment than expected by chance, while negave scores indicate that the substuon occurs less frequently than expected. Large posive scores oen indicate cri*cal func*onal residues, which may be acve site residues or residues required for other intermolecular interac*ons 6 Domain representation-Markov model Markov model: a way of describing a process that goes through a series of states. In a regular Markov model, the state is directly visible to the observer. Each state has a probability of transi*oning to the other states. Xk is a random variable of state. States are ∈ {A,C,G,T} State transi*on example: State at the (K+1) th step A C G T A 0,3 0,2 0,1 0,4 C 0,1 0,6 0,2 0,1 G 0,2 0,4 0,1 0,3 T 0,5 0,1 0,2 0,1 State at the K th step th K the at State 7 Domain representation-hidden Markov model In a Hidden Markov model , the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. An essential characteristic of a Markov process is that the change is dependent only on the current state. (Teorema di Bayes) The history of the system does not matter. The states that the system has been in before are not relevant, only the current state determines what will happen next. The system has no memory. The Hidden Markov model method was originally used in speech recognition before being applied to biological sequence analysis. HMM are used to represent sequence families. A particular type of HMM suited to modeling multiple alignments. Domain representation-hidden Markov model Each oval shape represents a random variable that can adopt a number of values. The random variable x2 is the hidden state at time 2. From the diagram, it is clear that the value of the hidden variable x2 (at time 2) only depends on the value of the hidden variable x1 (at time 1). The arrows in the diagram denote conditional dependencies. Similarly, the value of the observed variable y2 only depends on the value of the hidden variable x2 (both at time 2). Hidden Markov model: Each state x emits an output y, at a specific probability. We only know the output (observations). Thus, the states are hidden. X1 X2 X3 X4 Y1 Y2 Y3 Y4 Basic Architecture of a profile HMM Start d1 d2 d3 End i0 i1 i2 i3 m1 m2 m3 C C Y 0.01 0.5 0.01 Insert states: Model insertions of random letters between two alignment positions Silent states: Model detection which correspond to a gap in the alignment Transitions: States of neighboring positions are connected by transitions, that indicate the possibility of going from one state to the other Match State: Model the distribution of symbols in the corresponding column of an alignment Domain representation-hidden Markov model Given a multiple sequence alignment of a particular domain family, one uses statistical methods to build a specific HMM for that domain family. In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions. In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues. Protein domain db – PROSITE PROSITE is a database of biologically significant paerns and profiles that help to reliably iden*fy to which known protein family (if any) a new sequence belongs. There are a number of protein families as well as func*onal or structural domains that cannot be detected using paerns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detec*on of such domains. hp://prosite.expasy.org ABL1_HUMN Protein domain db – PROSITE Protein domain db – PRINTS The PRINTS database houses a collec*on of protein family fingerprints. Fingerprints = A recognized and powerful method of classifying new protein families is to use conserved regions within mul*ple alignments of related proteins. Each homologous region is a "mo*f", and sets of mo*fs provide a signature or fingerprint for unique iden*ficaon. These mo*fs usually denote a common structure and/or func*on between individual family members. hp://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php ABL1_HUMN Protein domain db – Pfam Pfam is a database of protein domain families. It contains curated mul*ple sequence alignments for each family and corresponding profile Hidden Markov Models (HMMs). HMMs are built from HMMER an hidden Markov model soaware, stand-alone available. This database is made up of two parts: Ø Pfam A: curated mul*ple alignments Ø Grows slowly Ø Quality controlled by experts Ø Pfam B: automac generated (ProDom derived) Ø Complements Pfam-A Ø New sequences instantly incorporated Ø Unchecked: false posi*ves, ... Protein domain db – Pfam hp://pfam.sanger.ac.uk ABL1_HUMN Protein domain db – ProDom ProDom families are built by an automated process based on a recursive use of PSI- BLAST (Posi*on Specific Iterated Blast) homology searches. PSI-BLAST PSI-BLAST is designed for more sensitive protein-protein similarity searches." Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. " " Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...". " 18 Protein domain db – SMART SMART (Simple Modular Architecture Research Tool) domains are extensively annotated with respect to phyle*c distribu*ons, func*onal class, ter*ary structures and func*onally important residues. SMART alignments are op*mised manually and following construcon of corresponding hidden Markov models (HMMs). hp://smart.embl-heidelberg.de ABL1_HUMN Protein domain db – InterPro The resources examined un*l now, have different areas of op*mum applicaon owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them. InterPro (The InterPro Consor*um 2001) is a collaborave project aimed at providing an integrated layer on top of the most commonly used signature databases by crea*ng a unique, non-redundant characterisaon of a given protein family, domain or func*onal site. The InterPro project home page is available at: hp://www.ebi.ac.uk/interpro Entry types in InterPro o! Family: group of evolu*onarily related proteins, that share one or more domains/ repeats in common. o! Domain: independent structural unit which can be found alone or in conjunc*on with other domains or repeats. o! Repeat: region occurring more than once that is not expected to fold into a globular domain on its own. o! PTM: (post-translaonal modificaon) -The sequence mo*f is defined by the molecular recogni*on of this region in a cell. o! Ac2ve site:cataly*c pockets of enzymes where the cataly*c residues are known. o! Binding site: binds compounds but is not necessarily involved in catalysis. Protein domain db – InterPro InterProScan InterProScan is a tool that combines different protein signature recogni*on methods nave to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotaon.
Recommended publications
  • Downloaded from TAIR10 [27]
    The Author(s) BMC Bioinformatics 2017, 18(Suppl 12):414 DOI 10.1186/s12859-017-1826-2 RESEARCH Open Access A sensitive short read homology search tool for paired-end read sequencing data Prapaporn Techa-Angkoon, Yanni Sun* and Jikai Lei From 12th International Symposium on Bioinformatics Research and Applications (ISBRA) Minsk, Belarus. June 5-8, 2016 Abstract Background: Homology search is still a significant step in functional analysis for genomic data. Profile Hidden Markov Model-based homology search has been widely used in protein domain analysis in many different species. In particular, with the fast accumulation of transcriptomic data of non-model species and metagenomic data, profile homology search is widely adopted in integrated pipelines for functional analysis. While the state-of-the-art tool HMMER has achieved high sensitivity and accuracy in domain annotation, the sensitivity of HMMER on short reads declines rapidly. The low sensitivity on short read homology search can lead to inaccurate domain composition and abundance computation. Our experimental results showed that half of the reads were missed by HMMER for a RNA-Seq dataset. Thus, there is a need for better methods to improve the homology search performance for short reads. Results: We introduce a profile homology search tool named Short-Pair that is designed for short paired-end reads. By using an approximate Bayesian approach employing distribution of fragment lengths and alignment scores, Short-Pair can retrieve the missing end and determine true domains. In particular, Short-Pair increases the accuracy in aligning short reads that are part of remote homologs. We applied Short-Pair to a RNA-Seq dataset and a metagenomic dataset and quantified its sensitivity and accuracy on homology search.
    [Show full text]
  • Specialized Hidden Markov Model Databases for Microbial Genomics
    Comparative and Functional Genomics Comp Funct Genom 2003; 4: 250–254. Published online 1 April 2003 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.280 Conference Review Specialized hidden Markov model databases for microbial genomics Martin Gollery* University of Nevada, Reno, 1664 N. Virginia Street, Reno, NV 89557-0014, USA *Correspondence to: Abstract Martin Gollery, University of Nevada, Reno, 1664 N. Virginia As hidden Markov models (HMMs) become increasingly more important in the Street, Reno, NV analysis of biological sequences, so too have databases of HMMs expanded in size, 89557-0014, USA. number and importance. While the standard paradigm a short while ago was the E-mail: [email protected] analysis of one or a few sequences at a time, it has now become standard procedure to submit an entire microbial genome. In the future, it will be common to submit large groups of completed genomes to run simultaneously against a dozen public databases and any number of internally developed targets. This paper looks at some of the readily available HMM (or HMM-like) algorithms and several publicly available Received: 27 January 2003 HMM databases, and outlines methods by which the reader may develop custom Revised: 5 February 2003 HMM targets. Copyright 2003 John Wiley & Sons, Ltd. Accepted: 6 February 2003 Keywords: HMM; Pfam; InterPro; SuperFamily; TLfam; COG; TIGRfams Introduction will be a true homologue. As a result, HMMs have become very popular in the field of bioinformatics Over the last few years, hidden Markov models and a number of HMM databases have been (HMMs) have become one of the pre-eminent developed.
    [Show full text]
  • Environmental Conditions Shape the Nature of a Minimal Bacterial Genome
    ARTICLE https://doi.org/10.1038/s41467-019-10837-2 OPEN Environmental conditions shape the nature of a minimal bacterial genome Magdalena Antczak 1, Martin Michaelis 1 & Mark N. Wass 1 Of the 473 genes in the genome of the bacterium with the smallest genome generated to date, 149 genes have unknown function, emphasising a universal problem; less than 1% of proteins have experimentally determined annotations. Here, we combine the results from 1234567890():,; state-of-the-art in silico methods for functional annotation and assign functions to 66 of the 149 proteins. Proteins that are still not annotated lack orthologues, lack protein domains, and/ or are membrane proteins. Twenty-four likely transporter proteins are identified indi- cating the importance of nutrient uptake into and waste disposal out of the minimal bacterial cell in a nutrient-rich environment after removal of metabolic enzymes. Hence, the envir- onment shapes the nature of a minimal genome. Our findings also show that the combination of multiple different state-of-the-art in silico methods for annotating proteins is able to predict functions, even for difficult to characterise proteins and identify crucial gaps for further development. 1 Industrial Biotechnology Centre, School of Biosciences, University of Kent, Canterbury, Kent CT2 7NJ, UK. Correspondence and requests for materials should be addressed to M.M. (email: [email protected]) or to M.N.W. (email: [email protected]) NATURE COMMUNICATIONS | (2019) 10:3100 | https://doi.org/10.1038/s41467-019-10837-2 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-10837-2 long-term goal of synthetic biology has been the identi- Unknown class of proteins (7%) have related sequences in Afication of the minimal genome, i.e., the smallest set of eukaryotes or archaea (6%) while just over half (55%) have genes required to support a living organism.
    [Show full text]
  • Dpcfam: a New Method for Unsupervised Protein Family Classification
    bioRxiv preprint doi: https://doi.org/10.1101/2020.07.30.224592; this version posted July 31, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. DPCfam: a new method for unsupervised protein family classification 1 1 2 Elena Tea Russo ,AlessandroLaio ⇤ and Marco Punta ⇤ July 28, 2020 1 SISSA, Trieste, 34126, Italy; 2 Centre for Evolution and Cancer, The Institute of Cancer Re- search, London, SM2 5NG, UK. Current address: Center for Omics Sciences, IRCCS San Raffaele Hospital, Milan, 20132, Italy (*) To whom correspondence should be adressed Abstract Motivation: As the UniProt database approaches the 200 million entries’ mark, the vast majority of proteins it contains lack any experimental validation of their functions. In this con- text, the identification of homologous relationships between proteins remains the single most widely applicable tool for generating functional and structural hypotheses in silico. Although many databases exist that classify proteins and protein domains into homologous families, large sections of the sequence space remain unassigned. Results: We introduce DPCfam, a new unsupervised procedure that uses sequence align- ments and Density Peak Clustering to automatically classify homologous protein regions. Here, we present a proof-of-principle experiment based on the analysis of two clans from the Pfam protein family database. Our tests indicate that DPCfam automatically-generated clusters are generally evolutionary accurate corresponding to one or more Pfam families and that they cover a significant fraction of known homologs.
    [Show full text]
  • Hidden Markov Model-Based Homology Search and Gene Prediction in Ngs Era
    HIDDEN MARKOV MODEL-BASED HOMOLOGY SEARCH AND GENE PREDICTION IN NGS ERA By Prapaporn Techa-angkoon A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2017 ABSTRACT HIDDEN MARKOV MODEL-BASED HOMOLOGY SEARCH AND GENE PREDICTION IN NGS ERA By Prapaporn Techa-angkoon The exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against charac- terized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is accumulating evidence showing important functions of RNAs (ncRNAs), comprehen- sive gene prediction should include both protein-coding genes and ncRNAs. Homology search against protein sequences can aid identification of functional elements in genomes. Although there are intensive research in the fields of gene prediction, ncRNA search, and homology search, there are still unaddressed challenges. In this dissertation, I made contributions in these three areas. For gene prediction, I designed an HMM-based ab initio gene prediction tool that considers G+C gradient in grass genomes. For homology search, I designed a method that can align short reads against protein families using profile HMMs. For ncRNA search, I designed an ncRNA alignment tool that can align highly structured ncRNAs using only sequence similarity. Below I summarize my contributions. Despite decades of research about gene prediction, existing gene prediction tools are not care- fully designed to deal with variant G+C content and 50-30 changing patterns inside coding regions.
    [Show full text]
  • Practical Course on Multiple Sequence Alignment
    Sequence Analysis and Structure Prediction Service Centro Nacional de Biotecnología – CSIC 8-10 May, 2013 Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations Course Notes Instructor: Mónica Chagoyen [email protected] Contents Contents........................................................................................................................................ 3 Introduction.................................................................................................................................. 5 Some notes on protein evolution............................................................................................ 6 Mutations ............................................................................................................................................7 Variation among species.....................................................................................................................7 Variation within species......................................................................................................................7 Domain shuffling ................................................................................................................................8 Finding sequences to align .......................................................................................................... 9 Fundamentals: pair-wise sequence alignment...................................................................... 9 Similarity scores .................................................................................................................................9
    [Show full text]
  • Introduction
    Title: FastBLAST: Homology relationships for millions of proteins Authors: Morgan N. Price and Paramvir S. Dehal and Adam P. Arkin Abstract: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding. We present FastBLAST, a heuristic replacement for all-versus-all BLAST that enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database (“NR”), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query. FastBLAST is open source software and is available at http://microbesonline.org/fastblast. Introduction Protein BLAST (basic local alignment search tool (Altschul et al., 1997)) is often used to identify homologs for every sequence in the database, which is also known as “all-versus-all” BLAST.
    [Show full text]
  • Interpro in 2017––Beyond Protein Family and Domain Annotations Robert D
    D190–D199 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016 doi: 10.1093/nar/gkw1107 InterPro in 2017––beyond protein family and domain annotations Robert D. Finn1,*, Teresa K. Attwood2, Patricia C. Babbitt3, Alex Bateman1, Peer Bork4,Alan J. Bridge5, Hsin-Yu Chang1, Zsuzsanna Dosztanyi´ 6, Sara El-Gebali1, Matthew Fraser1, Julian Gough7,DavidHaft8, Gemma L. Holliday3, Hongzhan Huang9, Xiaosong Huang10, Ivica Letunic11, Rodrigo Lopez1, Shennan Lu12, Aron Marchler-Bauer12, Huaiyu Mi10, Jaina Mistry1, Darren A Natale13, Marco Necci14, Gift Nuka1, Christine A. Orengo15, Youngmi Park1, Sebastien Pesseat1, Damiano Piovesan14, Simon C. Potter1,Neil D. Rawlings1, Nicole Redaschi5, Lorna Richardson1, Catherine Rivoire5, Amaia Sangrador-Vegas1, Christian Sigrist5, Ian Sillitoe15, Ben Smithers7, Silvano Squizzato1, Granger Sutton8, Narmada Thanki12, Paul D Thomas10, Silvio C. E. Tosatto14,16,CathyH.Wu9, Ioannis Xenarios5, Lai-Su Yeh13, Siew-Yit Young1 and Alex L. Mitchell1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2School of Computer Science, University of Manchester, UK, 3Department of Bioengineering & Therapeutic Sciences, University of California, San Francisco, CA 94143, USA, 4European Molecular Biology Laboratory, Biocomputing, Meyerhofstasse 1, 69117 Heidelberg, Germany, 5Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland, 6MTA-ELTE Lendulet¨
    [Show full text]
  • Tigrfams and Genome Properties in 2013 Daniel H
    Published online 28 November 2012 Nucleic Acids Research, 2013, Vol. 41, Database issue D387–D395 doi:10.1093/nar/gks1234 TIGRFAMs and Genome Properties in 2013 Daniel H. Haft1,*, Jeremy D. Selengut1, Roland A. Richter2, Derek Harkins1, Malay K. Basu1 and Erin Beck1 1Informatics, J Craig Venter Institute, Rockville, MD 20850 and 2Informatics, J Craig Venter Institute, La Jolla, CA 92121, USA Received October 15, 2012; Revised and Accepted October 31, 2012 ABSTRACT a protein family, versus which vary freely but could be assigned undue importance during the scoring of TIGRFAMs, available online at http://www.jcvi.org/ pairwise alignments. From these multiple alignments, tigrfams is a database of protein family definitions. profile hidden Markov Models (HMMs) are built. These Each entry features a seed alignment of trusted rep- probabilistic models allow exquisitely sensitive searches resentative sequences, a hidden Markov model for proteins related by homology to the aligned sequences. (HMM) built from that alignment, cutoff scores that The TIGRFAMs database is a collection of these HMMs let automated annotation pipelines decide which constructed with the purpose of letting automated anno- proteins are members, and annotations for transfer tation pipelines attach specific functional annotations to onto member proteins. Most TIGRFAMs models proteins encoded by newly sequenced microbial genomes. are designated equivalog, meaning they assign The HMM search produces evidence, and the logic of the a specific name to proteins conserved in function annotation software exploits the evidence. But the HMM evidence itself is persistent, and based on fixed cutoff from a common ancestral sequence. Models scores for consistency from use to use, and may be put describing more functionally heterogeneous to additional purposes.
    [Show full text]
  • Automated Functional Annotation and Necessary Tools
    Automated Functional Annotation and Necessary Tools Eukaryotic Annotation and Analysis Course Functional Annotation Overview What is functional annotation? Steps we take to annotate genes. Software tools used for functional annotation. Functional Annotation and its Goals Functional annotation in genomics is about classifying and attributing the identified structural elements. Goals Assigning names of gene products. Interpreting functions of genes within an organism if possible. Classifying the proteome into protein families. Identifying the enzymes and assigning the EC numbers in an organism. Assigning Gene Ontology (GO) terms for genes. Generating Metabolic pathways for an organism. Important to know what software tools we need and how to use them for optimum result. Steps in Functional Annotation Analyze the gene structure for accuracy. Extract the gene product sequence. Search the sequences through various software tools of different algorithms against different database sources. Optimize the parameters of the tools for efficient annotation. Critically evaluate the computationally derived annotation. Maintain and display the annotation data. Gene Product Sequence Searches Motifs and Protein Domains Targeting Sequences Signals Assignments Gene Product EC Metabolic Automated Protein Name Number Pathways GO Families Manual curation Basic Searches to Run BLAST (nucleotide or protein homology) TmHMM (Transmembrane domains) SignalP (signal peptide cleavage sites) TargetP (subcellular location) HMMer or SAM (searches using statistical descriptions) Pfam (database of protein families and HMMs) TIGRFAMS (protein family based HMMs) SCOP (Structural domains) CDD (NCBI’s Conserved Domain Database) Prosite (biologically significant sites, patterns and profiles) Interpro (protein families, domains and functional sites) Others as needed Automated Searches Through Pipeline The searches are run as a pipeline using Ergatis, a workflow system.
    [Show full text]
  • Yet Another Set of Protein Families by Folker Meyer1,2, Ross Overbeek3, Alex Rodriguez2
    1/19 FIGfams: Yet Another Set of Protein Families by Folker Meyer1,2, Ross Overbeek3, Alex Rodriguez2 1 Argonne National Laboratory, Argonne, IL 2 University of Chicago, Computation Institute, Chicago, IL 3 Fellowship for the Interpretation of genomes, Burr Ridge, IL Journal: NAR or Bioinformatics or BMC Bioinformatics Reviewers: - O. White, UMaryland - J. Selengut, TIGR - M DeJongh, Hope College - C. Wu or Peter from PIR Abstract: We present FIGfams, a new collection of over 100,000 protein families that are the product of manual curation and close strain comparison. The manual curation is carried out by using the Subsystem approach, ensuring a previously unattained degree of throughput and consistency. FIGfams are based on over 950,000 manually annotated proteins. Associated with each FIGfam is a two-tiered rapid, accurate decision procedure to determine family membership for new proteins. License: FIGfams are freely available under an open source license. Download: ftp://ftp.theseed.org/FIGfams/ Website: http://www.theseed.org/wiki/FIGfams// P1613.docx Last saved by ET MSD, edited G. Pieper 2/19 1. Introduction Progress in DNA sequencing technology has led to an abundance of nucleotide sequences in our databases (1). As the pace of sequencing increases (see, e.g., (2)) so does the importance of creating tools to accurately describe the protein functions encoded in the DNA sequences. These descriptions, or “annotations,” are created using a variety of bioinformatics tools and databases. Our most valuable clues to decipher functions of unknown proteins is their comparison with existing proteins in some form (3). A number of groups are curating large sets of existing genomes using a variety of approaches (4-6), and even more groups are focusing their curation efforts on sets of proteins (6-12).
    [Show full text]
  • Latest Publications
    InterPro InterPro Team Sep 10, 2021 ABOUT INTERPRO 1 About InterPro 1 2 Citing InterPro 3 2.1 Latest publications............................................3 2.2 All previous publications.........................................3 3 InterPro tutorials & Webinars7 3.1 Tutorials.................................................7 3.2 Webinars.................................................7 4 Upcoming courses and webinars9 4.1 Structural bioinformatics course (virtual)................................9 5 Previous courses 11 5.1 Structural bioinformatics course (virtual)................................ 11 5.2 Bioinformatics Resources for Protein Biology.............................. 11 6 InterPro Entries : essential information 13 6.1 InterPro entry types........................................... 13 6.2 Other entry and page types........................................ 14 6.3 Entry relationships............................................ 14 6.4 Overlapping entries........................................... 14 7 InterPro website banner 15 7.1 Navigation banner and menu....................................... 15 8 InterPro homepage 17 8.1 InterPro resource overview........................................ 18 8.2 Search box................................................ 18 8.3 Data.................................................... 19 8.4 News and information.......................................... 22 9 How to search the InterPro website? 23 9.1 Quick search............................................... 23 9.2 Sequence search............................................
    [Show full text]